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Abstract 

This paper presents an implemented system for recognizing the occurrence of events described 
by simple spatial-motion verbs in short image sequences. The semantics of these verbs is specified 
with event-logic expressions that describe changes in the state of force-dynamic relations between 
the participants of the event. An efficient finite representation is introduced for the infinite sets 
of intervals that occur when describing hquid and semi-liquid events. Additionally, an efficient 
procedure using this representation is presented for inferring occurrences of compound events, de- 
scribed with event-logic expressions, from occurrences of primitive events. Using force dynamics 
and event logic to specify the lexical semantics of events allows the system to be more robust than 
prior systems based on motion profile. 

1. Introduction 

If one were to look at the image sequence in Figure 1(a), one could describe the event depicted 
in that sequence by saying Someone picked the red block up off the green block. Similarly, if 
one were to look at the image sequence in Figure 1(b), one could describe the event depicted in 
that sequence by saying Someone put the red block down on the green block. One way that one 
recognizes that the former is a pick up^ event is that one notices a state change in the force-dynamic 
relations between the participant objects. Prior to Frame 13, the red block is supported by the 
green block by a substantiality constraint, the fact that solid objects cannot interpenetrate (Spelke, 
1983; Baillargeon, Spelke, & Wasserman, 1985; Baillargeon, 1986, 1987; Spelke, 1987, 1988). 
From Frame 13 onward, it is supported by being attached to the hand. Similarly, one way that one 
recognizes that the latter is a put down event is that one notices the reverse state change in Frame 14. 
This paper describes an implemented computer system, called Leonard, that can produce similar 
event descriptions from such image sequences. A novel aspect of this system is that it produces event 
descriptions by recognizing state changes in force-dynamic relations between participant objects. 
Force dynamics is a term introduced by Talmy (1988) to describe a variety of causal relations 
between participants in an event, such as allowing, preventing, and forcing. In this paper, 1 use 
force dynamics in a slightly different sense, namely, to describe the support, contact, and attachment 
relations between participant objects. 

A number of systems have been reported that can produce event descriptions from video input. 
Examples of such systems include the work reported in Yamoto, Ohya, and Ishii (1992), Starner 
(1995), Siskind and Morris (1996), Brand (1996), andBobick andlvanov (1998). Leonard differs 

1. Throughout this paper, I treat verb-particle constructions, like pick up, as atomic verbs, despite the fact that the verb 

and its associated particle may be discontinuous. Methods for deriving the semantics of a verb-particle construction 
compositionally from its (discontinuous) constituents are beyond the scope of this paper. 
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gure 1 : Image sequences depicting (a) a pick up event and (b) a put down event. 
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from these prior systems in two crucial ways. First, the prior systems classify events based on the 
motion profile of the participant objects. For example, Siskind and Morris (1996) characterize a pick 
up event as a sequence of two subevents: the agent moving towards the patient while the patient is 
at rest above the source, followed by the agent moving with the patient away from the source while 
the source remains at rest. Similarly, a put down event is characterized as the agent moving with 
the patient towards the destination while the destination is at rest, followed by the agent moving 
away from the patient while the patient is at rest above the destination. In contrast, Leonard 
characterizes events as changes in the force-dynamic relations between the participant objects. For 
example, a pick up event is characterized as a change from a state where the patient is supported by 
a substantiality constraint with the source to a state where the patient is supported by being attached 
to the agent. Similarly, a put down event is characterized as the reverse state change. Irrespective 
of whether motion profile or force dynamics is used to recognize events, event recognition is a 
process of classifying time-series data. In the case of motion profile, this time-series data takes the 
form of relative-and-absolute positions, velocities, and accelerations of the participant objects as a 
function of time. In the case of force dynamics, this time-series data takes the form of the truth 
values of force-dynamic relations between the participant objects as a function of time. This leads 
to the second difference between Leonard and prior systems. The prior systems use stochastic 
reasoning, in the form of hidden Markov models, to classify the time-series data into event types. 
In contrast, LEONARD uses logical reasoning, in the form of event logic, to do this classification. 

Using force dynamics and event logic (henceforth the 'new approach') to recognize events of- 
fers several advantages over using motion profile and hidden Markov models (henceforth the 'prior 
approach'). First, the new approach will correctly recognize an event despite a wider variance in 
motion profile than the prior approach. For example, when recognizing, say, a pick up event, the 
prior approach is sensitive to aspects of event execution, like the approach angle and velocity of the 
hand, that are irrelevant to whether or not the event is actually a pick up. The new approach is not 
sensitive to such aspects of event execution. Second, the new approach will correctly recognize an 
event despite the presence of unrelated objects in the field of view. The prior approach computes 
the relative-and-absolute positions and motions of all objects and pairs of objects in the field of 
view. It then selects the subset of objects whose positions and motions best matched some model. 
This could produce incorrect descriptions when some unintended subset matched some unintended 
model better than the intended subset matched the intended model. The new approach does not 
exhibit such deficiencies. Extraneous objects typically do not exhibit the precise sequence of state 
changes in force-dynamic relations needed to trigger the event-classification process and thus will 
not generate spurious claims of event occurrences. Third, the new approach performs temporal and 
spatial segmentation of events. The prior approach matches an entire image sequence against an 
event model. It fails if that image sequence depicts multiple event executions, either in sequence 
or in parallel. In contrast, the new approach can segment a complex image sequence into a collec- 
tion of sequential and/or overlapping events. In particular, it can handle hierarchal events, such as 
move, that consist of a pick up event followed by a put down event. It can recognize that all three 
events, and precisely those three events, occur in an appropriate image sequence whereas the prior 
approach would try to find the single best match. Finally, the new approach robustly detects the 
non-occurrence of events as well as the occurrence of events. The prior approach always selects the 
best match and reports some event occurrence for every image sequence. Thresholding the match 
cost does not work because an approach based on motion profile can be fooled into triggering recog- 
nition of an event occurrence by an event whose motion profile is similar to one or more target event 
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Figure 2: Image sequences depicting non-events. 



classes even though that event is not actually in any of those target event classes. Consider, for 
example, the two image sequences in Figure 2. Suppose that an event-recognition system contained 
two target event classes, namely pick up and put down. Neither of the image sequences depict pick 
up or put down events. Nonetheless, the prior approach might mistakingly classify Figure 2(a) as a 
pick up event because the second half of this image sequence matches the second half of the motion 
profile of a pick up event. Alternatively, it might mistakingly classify this image sequence as a put 
down event because the first half of this image sequence matches the first half of the motion profile 
of a put down event. Similarly, the prior approach might mistakingly classify Figure 2(b) as a pick 
up event because the first half of this image sequence matches the first half of the motion profile 
of a pick up event. Alternatively, it might mistakingly classify this image sequence as a put down 
event because the second half of this image sequence matches the second half of the motion profile 
of a put down event. In contrast, the new approach correctly recognizes that neither of these image 
sequences exhibit the necessary state changes in force-dynamic relations to qualify as either pick up 
or put down events. All four of these advantages will be discussed in greater detail in Section 5. 
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The techniques described in this paper have been implemented in a system called Leonard. 
Leonard is a comprehensive system that takes image sequences as input and produces event de- 
scriptions as output. The overall architecture of Leonard is shown in Figure 6. The input to 
Leonard consists of a sequence of images taken by a Canon VC-C3 camera and Matrox Me- 
teor frame grabber at 320x240 resolution at 30fps. This image sequence is first processed by a 
segmentation-and-tracking component. A real-time colour- and motion-based segmentation algo- 
rithm places a convex polygon around each coloured and moving object in each frame. A tracking 
algorithm then forms a correspondence between the polygons in each frame and those in temporally 
adjacent frames. The output of the segmentation-and-tracking component consists of a sequence of 
scenes, each scene being a set of polygons. Each polygon is represented as a sequence of image 
coordinates corresponding to a clockwise traversal of the polygon's vertices. The tracker guar- 
antees that each scene contains the same number of polygons and that they are ordered so that 
the i^^ polygon in each scene corresponds to the same object. Figure 3 shows the output of the 
segmentation-and-tracking component on the image sequences from Figure 1. The polygons have 
been overlayed on the input images for ease of comprehension. 

This scene sequence is passed to a model-reconstruction component. This component produces 
a force-dynamic model of each scene. This model specifies three types of information: which 
objects are grounded, i.e. are supported by an unseen mechanism that is not associated with any vis- 
ible object, which objects are attached to other objects by rigid or revolute joints, and the qualitative 
depth of each object, i.e. a qualitative representation of the relative distance of different objects in 
the field of view from the observer, in the form of a same layer relation specifying which objects 
are at the same qualitative depth. Figure 4 shows the output of the model-reconstruction component 
on the scene sequences from Figure 3. The models are depicted graphically, overlayed on the input 
images, for ease of comprehension. The details of this depiction scheme will be described momen- 
tarily. For now, it suffices to point out that Figure 4(a) shows the red block on the same layer as the 
green block up through Frame 1 and attached to the hand from Frame 14 onward. Figure 4(b) shows 
the reverse sequence of relations, with the red block attached to the hand up through Frame 13 and 
on the same layer as the green block from Frame 23 onward. 

This model sequence is passed to an event-classification component. This component first de- 
termines the intervals over which certain primitive event types are true. These primitive event types 
include SupPORTED(rE), SUPPORTS (a;, y), CONTACTS (x, y), and ATTACHED (re, y). This compo- 
nent then uses an inference procedure to determine the intervals over which certain compound 
event types are true. These compound event types include PlCKUp(a;, y, z), PUTD0WN(a;, y, z), 
Stack(w, X, y, z), Unstack(w, X, y, z), Move(w, x, y, z), Assemble(w, x, y, z), and 
Disassemble (to, x, y, z) and are specified as expressions in event logic over the primitive event 
types. The output of the event-classification component consists of an indication of which com- 
pound event types occurred in the input movie as well as the subsequence(s) of frames during which 
those event types occurred. Figure 5 shows the output of the event-classification component on the 
model sequences from Figure 4. The subsequences of frames during which the events occurred are 
depicted as spanning intervals. Spanning intervals will be described in Section 4.1. 

Leonard is too complex to describe completely in one paper. This paper provides a detailed 
description of the event-classification component and, in particular, the event-logic inference pro- 
cedure. The segmentation and tracking algorithms are extensions of the algorithms presented in 
Siskind and Morris (1996) and Siskind (1999), modified to place convex polygons around the par- 
ticipant objects instead of ellipses. The model-reconstruction techniques are extensions of those 
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Figure 3: The output of the segmentation-and-tracking component applied to the image sequences 
from Figure 1 . (a) depicts a pick up event, (b) depicts a put down event. The polygons 
have been overlayed on the input images for ease of comprehension. 



36 



Grounding the Lexical Semantics of Verbs 




Frame Frame 1 Frame 2 




Frame 13 Frame 14 Frame 20 




Frame 1 Frame 13 Frame 14 




Frame 22 Frame 23 Frame 27 



Figure 4: The output of the model-reconstruction component applied to the scene sequences from 
Figure 3. (a) depicts a pick up event, (b) depicts a put down event. The models have 
been overlayed on the input images for ease of comprehension. In (a), the red block is 
on the same layer as the green block up through Frame 1 and is attached to the hand 
from Frame 14 onward. In (b), the reverse sequence of relations holds, with the red block 
attached to the hand up through Frame 13 and on the same layer as the green block from 
Frame 23 onward. 
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Figure 5: The output of the event-classification component applied to the model sequences from 
Figure 4. Note that the pick up event is correctly recognized in (a) and the put down event 
is correctly recognized in (b). 
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Figure 6: The overall architecture of Leonard. 



presented in Siskind (1997, 2000). The model-reconstruction techniques will be described briefly 
below to allow the reader to understand the event-classification techniques without reference to 
those papers. 

2. Model Reconstruction 

Certain properties of objects are visible. For example, position, orientation, shape, size, colour, 
texture, and so forth. Furthermore, relational variants of these properties are also visible, as well 
as changes in such properties and relations over time. In contrast, force-dynamic properties and 
relations are not visible. One cannot see the fact that the door knob is attached to, and supported by, 
the door. One must infer that fact using physical knowledge of the world. Such knowledge includes 
the fact that unsupported objects fall and attachment is one way of offering support. Using physi- 
cal knowledge to infer force-dynamic properties and relations was first discussed by Siskind (1991, 
1992, 1993). This later became known as the perceiver framework advanced by Jepson and Richards 
(1993). The perceiver framework states that perception involves four levels. First, one must spec- 
ify the observahles, what properties and relations can be discerned by direct observation. Second, 
one must specify an ontology, what properties and relations must be inferred from the observables. 
Descriptions of the observables in terms of such properties and relations are called interpretations. 
There may be multiple interpretations of a given observation. Third, one must specify a theory, 
a way of differentiating consistent interpretations from inconsistent ones. The consistent interpre- 
tations are the models of the observation. There may be multiple models of a given observation. 
Finally, one must specify a preference relation, a way of ordering the models. The most-preferred 
models of the observations are the percepts. One can instantiate the perceiver framework for dif- 
ferent observables, ontologies, theories, and preference relations. Siskind (1991, 1992, 1993, 1994, 
1995, 1997) instantiated this framework for a kinematic theory applied to simulated video. Mann, 
Jepson, and Siskind (1996, 1997) and Mann and Jepson (1998) instantiated this framework for a 
dynamics theory applied to real video. Siskind (2000) instantiated this framework for a kinematic 
theory applied to real video. This paper uses this later approach. 

The input to the model-reconstruction process consists of a sequence of scenes, each scene be- 
ing a set of convex polygons. Each polygon is represented as a sequence of points corresponding 
to a clockwise traversal of the polygon's vertices. The tracker guarantees that each scene contains 
the same number of polygons and that they are ordered so that the i^^ polygon in each scene corre- 
sponds to the same object. The output of the model-reconstruction process consists of a sequence of 
interpretations, one interpretation per scene. The interpretations are formulated out of the following 
primitive properties of, and relations between, the objects in each scene. 

Grounded (p) Polygon p is grounded. It is constrained to occupy a fixed position and orientation 
by an unseen mechanism that is not associated with any visible object and thus cannot move 
either translationally or rotationally. 
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Figure 7: The graphical method for depicting interpretations that is used in this paper. The symbol 
'i' indicates that a polygon is grounded. A solid circle indicates a rigid joint. A hollow 
circle indicates a revolute joint. Two polygons with the same layer index are on the same 
layer. 



Rigid(p, q, r) Polygons p and q are attached by a rigid joint at point r. Both the relative position 
and orientation of p and q are constrained. 

Revolute (j;, q, r ) Polygons p and q are attached by a revolute joint at point r. The relative posi- 
tion of p and q is constrained but the relative orientation is not. 

SameLayer(p, q) Polygons p and g are on the same layer. Layers are a qualitative representation 
of depth, or distance from the observer. This representation is impoverished. There is no 
notion of 'in-front-of ' or 'behind' and there is no notion of adjacency in depth. The only 
representable notion is whether two objects are on the same or different layers. The same- 
layer relation is constrained to be an equivalence relation, i.e. it must be reflexive, symmetric, 
and transitive. Furthermore, two objects on the same layer must obey the substantiality con- 
straint, the constraint that they not interpenetrate (SpeUce, 1983; Baillargeon et al., 1985; 
Baillargeon, 1986, 1987; Spelke, 1987, 1988). 

An interpretation I is a 4-tuple: (GROUNDED, RIGID, REVOLUTE, SameLayer). Throughout 
this paper, interpretations will be depicted graphically, overlayed on scene images, for ease of com- 
prehension. Figure 7 gives a sample interpretation depicted graphically. The symbol attached 
to a polygon indicates that it is grounded. A solid circle indicates that two polygons are rigidly 
attached at the center of the circle. A hollow circle indicates that two polygons are attached by a 
revolute joint at the center of the circle. The same-layer relation is indicated by giving a layer index, 
a small nonnegative integer, to each polygon. Polygons with the same layer index are on the same 
layer, while those with different layer indices are on different layers. 

Model reconstruction can be viewed as a generate-and-test process. Initially, all possible inter- 
pretations are generated for each scene. Then, inadmissible and unstable interpretations are filtered 
out. Admissibility and stability can be collectively viewed as a consistency requirement. The sta- 
ble admissible interpretations are thus models of a scene. The nature of the theory guarantees that 
there will always be at least one model for each scene, namely the model where all objects are 
grounded. There may, however, be multiple models for a given scene. Therefore, a preference re- 
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lation is then applied through a sequence of circumscription processes (McCarthy, 1980) to select 
the minimal, or preferred, models for each scene. While there will always be at least one mini- 
mal model for each scene, there may be several, since the preference relation may not induce a 
total order. If there are multiple minimal models for a given scene, one is chosen arbitrarily as the 
most-preferred model for that scene. The precise details of the admissibility criteria, the stability 
checking algorithm, the preference relations, and the circumscription process are beyond the scope 
of this paper. They are discussed in Siskind (2000). What is important, for the purpose of this paper, 
is that, given a scene sequence, model reconstruction produces a sequence of interpretations, one 
for each scene, and that these interpretations are 4-tuples containing the predicates GROUNDED, 
Rigid, Re volute, and SameLayer. Figure 4 shows sample interpretation sequences produced 
by the model-reconstruction component on the scene sequences from Figure 3. 

3. Event Logic 

Model reconstruction determines the truth values of the force-dynamic relations on a frame-by- 
frame basis in the input movie. Intervals of constant truth value for a given force-dynamic rela- 
tion are taken to be primitive event occurrences. Leonard uses event logic to infer compound 
event occurrences from primitive event occurrences. For example, for the image sequence in Fig- 
ure 1(a), model reconstruction determines that the green block supports the red block from Frame 
to Frame 13 and that the hand is attached to the red block from Frame 13 to Frame 20. This will be 
denoted as S upports (green-block, red-block) @[0, 13) and 
Attached (hand, red-block)@[13, 20), i.e. that the primitive event types 

Supports (green-block, red-block) and Attached (hand, red-block) occurred during the inter- 
vals [0, 13) and [13, 20) respectively. The compound event type 
PiCKUp(hand, red-block, green-block) might be defined as 

Supports (green-block, red-block) ; Attached (hand, red-block) 

i.e. Supports (green-block, red-block) followed by Attached (hand, red-block). (In the above, 
I use ';' informally as a sequence operator. The precise definition of ';' will be given momentarily.) 

The task of the event-logic inference procedure is to infer 

PiCKUp(hand, red-block, green-block )@[0, 20), i.e. that the compound event type 
PlCKUp(hand, red-block, green-block) occurred during the interval [0, 20). 

Event logic provides a calculus for forming compound event types as expressions over primitive 
event types. The syntax and semantics of event logic will be described momentarily. Event-logic 
expressions denote event types, not event occurrences. As such, they do not have truth values. 
Rather, they are predicates that describe the truth conditions that must hold of an interval for an 
event to occur. In contrast, an event-occurrence formula does have a truth value. If $ is an event- 
logic expression that denotes a primitive or compound event type, and i is an interval, then $@i 
is an atomic event-occurrence formula that is true if and only if the truth conditions for the event 
type $ hold of the interval i. 

<I>@i denotes coincidental occurrence, the fact that an occurrence of <^ started at the beginning 
of 1 and finished at the end of 1. $@i would not hold if an occurrence of $ did not precisely coin- 
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cide with i, but instead overlapped,^ partially or totally, with i. Event types have internal temporal 
structure that render this distinction important. In the case of primitive event types, that structure is 
simple. Each primitive event type is derived from a predicate. If is a predicate, then denotes the 
primitive event type derived from cf). A primitive event type holds of an interval if the correspond- 
ing predicate (f) holds of every instant in that interval.^ This means that -i(0@i) and might 
have different truth values. For example, if (p is true of every instant in [0, 2) and false of every other 
instant, then 3)) is true while 3) is false. Event logic takes coincidental occurrence 

to be a primitive notion. As will be demonstrated below, overlapping occurrence is a derived notion 
that can be expressed in terms of coincidental occurrence using compound event-logic expressions. 

Two auxiliary notions are needed to define the syntax and semantics of event logic. First, there 
are thirteen possible relations between two intervals. Following Allen (1983), I denote these rela- 
tions as =, <, >, m, mi, o, oi, s, si, f, fi, d, and di and refer to them collectively as Allen relations 
throughout the paper. The names m, o, s, f, and d are mnemonics for meet, overlap, start, finish, 
and during respectively. The inverse relations, such as mi, whose names end in i are the same as 
the corresponding relations, such as m, whose names do not end in i except that the arguments are 
reversed. Figure 8 depicts all thirteen Allen relations graphically. Second, I define the span of two 
intervals i and j, denoted SPAN(i, j), as the smallest super-interval of both i and j. 

The syntax of event logic is defined as follows. We are given finite disjoint sets of constant 
symbols along with a finite set of primitive event-type symbols, each of a specified arity. Constant 
symbols, such as red-block and hand, denote objects in the input movie while primitive event-type 
symbols, such as SUPPORTS, denote parameterized primitive event types. An atomic event-logic 
expression is a primitive event-type symbol of arity n applied to a sequence of n constants. For 
example, SUPPORTS (hand, red-block). An event-logic expression is either an atomic event-logic 
expression or one of the compound event-logic expressions ^ Ar'^, or Or^, where $ 

and * are event-logic expressions and R C {=, <, >, m, mi, o, oi, s, si, f, fi, d, di}. 

Informally, the semantics of compound event-logic expressions is defined as follows: 

• -1$ denotes the non-occurrence of An occurrence of coincides with i if no occurrence 
of $ coincides with i. Note that (-i$)@i could be true, even if an occurrence of $ overlapped 
with i, so long as no occurrence of $ coincided with i. 

• $ V * denotes the occurrence of either $ or 

• $ Aii * denotes the occurrence of both $ and The occurrences of $ and * need not 
be simultaneous. The subscript R specifies a set of allowed AUen relations between the 
occurrences of $ and If occurrences of $ and * coincide with i and j respectively, and 
irj for some r E R, then an occurrence of $ Ar ^ coincides with the span of i and j. I 
abbreviate the special case $ A^^j * simply as $ A * without any subscript. $ A * describes 
an aggregate event where both $ and * occur simultaneously. I also abbreviate the special 
case $ Ajm} * as * describes an aggregate event where an occurrence of $ is 
immediately followed by an occurrence of 

2. I am using the term 'overlap' here in a different sense than the o relation from Allen (1983). Here, I am using the 
term overlap in the sense that two intervals overlap if they have a nonempty intersection. This corresponds to the 
union of the o, oi, s, si, f, fi, d, and di relations from Allen (1983). 

3. To deal with noise, in the actual implementation, the primitive event type (j> is derived from the predicate </> by first 
passing through a low-pass filter that takes (j){t) to be the majority vote of a five-frame window centered on t. 
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Figure 8: The Allen relations, the thirteen possible relations between two intervals. 
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• An occurrence of Or^ coinciding with i denotes an occurrence of $ at some other inter- 
val j such that jri for some r ^ R. On can act as a tense operator. Expressions such as 
Oj<}$, Oj>}$, Ojni}^5 and 0{mi}*^ specify that $ happened in the noncontiguous past, 
noncontiguous future, contiguous past, or contiguous future respectively. The O^^ operator 
can also be used to derive overlapped occurrence from coincidental occurrence. An occur- 
rence of O{=,o,oi,s,si,f,fi,d,di}^ coincides with i if an occurrence of $ overlaps with i. I ab- 
breviate O{=,o,oi,s,si,f,fi,d,di}^ simply as 0$ without any subscript. Note that while (-i$)@i 
indicates that no occurrence of $ coincided with i, (-iO$)@i indicates that no occurrence 
of $ overlapped with i. 

Formally, the truth of an atomic event-occurrence formula $@i is defined relative to a model. 
Let / be the set of all intervals and O be the set of all objects in the movie. A model M is a map 
from primitive event-type symbols of arity n to subsets of I x O x ■ ■ ■ x O. (M can be viewed as 

n 

either a model or as a movie.) M thus maps primitive event-type symbols to relations that take an 
interval as their first argument, in addition to the remaining object parameters. The semantics of 
event logic is formally defined by specifying an entailment relation M |= $@i as follows: 

• M ^p(ii,...,i„)@iifandonlyif g M(p). 

• M ^ (^$)@i if and only if M ^ 

• M ^ V if and only if M |= $@i or M |= 

• M \= Ar if and only if there exist two intervals j and k such that i = SPAN(j, k), 
jrk for some r & R, M \= and M |= *@k. 

• M 1= if and only if there exists some interval j such that jri for some r G i? and 

Figure 9 shows the primitive event types currently used by LEONARD. The definitions in Fig- 
ure 9 are formulated in terms of the predicates GROUNDED, Rigid, Revolute, and SameLayer 
that are produced by model reconstruction as described in Section 2. An object is supported if it is 
not grounded. Two objects contact if their polygons touch and they are on the same layer. Two poly- 
gons p and q touch, denoted TOUCHES (p, q), if they intersect and their intersection has zero area. 
Two objects are attached if there is a rigid or revolute joint that joins them. Determining whether 
an object x supports an object y is a little more complex and requires counterfactual reasoning. 
Let V be the set of polygons in a scene, / = (GROUNDED, RIGID, Revolute, SameLayer) be 
the most-preferred model of the scene as produced by model reconstruction, and Stable(7^, I) be 
true if the scene V is stable under an interpretation /. StabiUty analysis, i.e. the Stable predicate, 
is a subroutine used by the model-reconstruction component. An object x supports an object y if y 
is not grounded in the most-preferred model I and a variant of V with x removed is not stable under 
a variant of / where all objects except for y and those rigidly attached, directly or indirectly, to y are 
grounded. In Figure 9, 'P \ {x} denotes the variant of V with x removed, RlGATT(a;, y) denotes the 
fact that X is rigidly attached to y, RigAtt* denotes the reflexive transitive closure of the RigAtt 
relation, {2;|-iRigAtt*(^, y)} denotes the set of objects that are rigidly attached, directly or indi- 
rectly, to y, and (GROUNDED U {2;|-iRigAtt*(2;, y)}, RIGID, REVOLUTE, SameLayer) denotes 
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A 



x = y 



x = y 



Supported (a;) 
RlGAn{x,y) 



A 



-■Grounded (a;) 

(3r)RlGID(;r. y. r) 



A 



-■Grounded (y) A 



Grounded U {2;|-^RiGATT*(z,y)}, \ 
Rigid, \ 

Re VOLUTE, / 

SameLayer j 



Supports (a;,?/) 



A 



^Stable V \ {x}, 



Contacts (a;,?/) 
Attached (a;,?/) 



A 



T0UCHES(a;, y) A SAMELAYER(a;, y) 



A 



(3r)RlGlD(a;, y, r) V REVOLUTE(a;, y, r) 



Figure 9: Definition of the primitive event types used by Leonard. 



a variant of / where all objects except for y and those rigidly attached, directly or indirectly, to y 
are grounded. 

Figure 10 shows the compound event-type definitions currently used by Leonard. 
PlCKUp(a;, y, z) denotes an event type where x picks y up off of z. It is specified as a sequence 
of three intervals, where x is not attached to and does not support y in the first interval but is at- 
tached to and does support y in the third interval. Additionally, z supports y in the first interval 
but does not support y in the third interval. Furthermore, several conditions must hold in both the 
first and third intervals: x must be unsupported, y must not support either a; or 2;, a; and z must 
not support each other, and y must not be attached to z. During the second interval, intermediate 
between the first and third intervals, either x is attached to y or y is attached to z.'^ Additionally, 
several conditions must hold throughout the entire event: x, y, and z must be distinct and y must 
be supported. PUTD0WN(a;, y, z) denotes an event type where x puts y down on z. It is specified 
in a fashion that is similar to PlCKUP(a;, y, z) but where the three subevents occur in reverse order. 
Stack(i(;, X, y, z) denotes an event type where w puts x down on y which is resting on z. It is spec- 
ified as PUTDowN(ty, x, y), where z supports but is not attached to y and z is distinct from w, x, 
and y. Unstack(i(;, x, y, z) denotes an event type where w picks x up off of y which is resting on z. 
It is specified as PickUp(w, x, y), where z supports but is not attached to y and z is distinct from w, 
X, and y. M0VE(^[;, x, y, z) denotes an event type where w picks x up off of y and puts it down 
on z which is distinct from y. ASSEMBLE (w, a;, y, 2;) denotes an event type where w first puts y 
down on z then sometime later stacks x on top of y. Finally, DISASSEMBLE (to, a;, y, 2;) denotes 

4. Originally, a two-interval definition was used, consisting of only the first and third intervals. Such a definition better 
reflects human intuition. This requires that x be unsupported, y not support either x or z, x and z not support each 
other, and y not be attached to z throughout the event. Unfortunately, however, the model-reconstruction process 
has some quirks. When the hand grasps the patient while the patient is still resting on the source it produces a most- 
preferred model where all three objects are attached and collectively supported by one of them being grounded. While 
such a model is consistent with the theory and is minimal, it does not match human intuition. Pending improvements 
in the model-reconstruction process to better reflect human intuition, the compound event-type definition for pick 
up was modified to reflect the force-dynamic interpretations produced by the current model-reconstraction process. 
Fortunately, the current model-reconstruction process is robust in reproducing this counterintuitive interpretation. 
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an event type where w first unstacks x from on top of y (which is resting on z) and then some- 
time later picks y up off of z. Figure 1 shows sample movies depicting occurrences of the event 
types PlCKUp(a;, y, and PUTD0WN(a;, y, 2;). Figures 11 through 15 show sample movies de- 
picting occurrences of the event types Stack(i(;, x, y, z), UNSTACk(i(;, x, y, z), Moye{w, x, y, z), 
Assemble(w, X, y, z), and DISASSEMBLE (w, X, y, z) respectively. 

Nominally, all atomic event-logic expressions are primitive event types. However, we allow 
giving a name to a compound event-logic expression and using this name in another event-logic 
expression as short hand for the named expression with appropriate parameter substitution. This 
is simply a macro-expansion process and, as such, no recursion is allowed. This feature is used in 
Figure 10 to define Unstack, Move, and Disassemble in terms of PickUp; Stack, Move, 
and Assemble in terms of PutDown; Assemble in terms of Stack, which is itself defined in 
terms of PutDown; and Disassemble in terms of Unstack, which is itself defined in terms of 
PickUp. 

The overall goal of the event-classification component is to infer all occurrences of a given set 
of compound event types from a given set of primitive event occurrences. The model-reconstruction 
component combined with the primitive event-type definitions given in Figure 9 produces a set of 
primitive event occurrences for a given scene sequence. Figure 10 lists parameterized compound 
event types. These are instantiated for all tuples of objects in the scene sequence to yield ground 
compound event-logic expressions. The event-classification component infers all occurrences of 
these compound event types that follow from the set of primitive event occurrences. Let us define 
S{M, 4?) to be {i|M |= $@i}. The model-reconstruction component combined with the primitive 
event-type definitions given in Figure 9 produces M. Instantiating the parameterized compound 
event types from Figure 10 for all object tuples yields a set of event-logic expressions. The event- 
classification component computes £{M, for every $ in this set. 

In principle, £^(M, could by implemented as a straightforward appUcation of the formal se- 
mantics for event logic as specified above. There is a difficulty in doing so, however. The primitive 
event types have the property that they are liquid. Liquid events have the following two properties. 
First, if they are true during an interval i, then they are also true during any subinterval of i. Second, 
if they are true during two overlapping intervals i and j, then they are also true during SPAN(i, j) and 
any subinterval of SPAN(i, j). For example, if an object is supported during [1, 10], then it also is 
supported during [2, 5], [3, 8], and all other subintervals of [1, 10]. Similarly, if an object is supported 
during [1, 5] and [4, 10], then it also is supported during [1, 10] and all of its subintervals. Shoham 
(1987) introduced the notion of liquidity and Vendler (1967), Dowty (1979), Verkuyl (1989), and 
Krifka (1992) have observed that many event types have this property. Because the primitive event 
types are liquid, they will hold over an infinite number of subintervals. This renders the formal 
semantics inappropriate for a computational implementation. Even if one limits oneself to intervals 
with integral endpoints, the primitive event types will hold over quadratically many subintervals of 
the scene sequence. Furthermore, a straightforward computational implementation of the formal se- 
mantics would be inefficient, because it requires quantifying over subintervals to implement O f> and 
quantifying over pairs of subintervals to implement Ar. The central result of this paper is a novel 
representation, called spanning intervals, that allows an efficient representation of the infinite sets 
of subintervals over which liquid event types hold along with an efficient inference procedure that 
operates on that representation. This representation, and the inference procedure that implements 
£{M, are presented in the next section. 
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PickUp{x, y,z) 



PUTD0WN(a;,y, 2;) 



Stack{w, x,y,z) 

UmTACK{w, x,y,z) 

MOYB{w, x,y,z) 
AssEMBLE(w, X, y, z) 
Disassemble (w, x, y, z) 



A 



A 



A 



A 



A 



A 



( -lOa; = y A ^<>z = x t\ ^Oz = yA 

SUPPORTED(y) A -■OATTACHED(a:, z) A 

-.OATTACHED(a;,y) A -■OSUPPORTS(a;, y) A 
Supports (2;,?/) A 

-.OSUPPORTED(a:) A -.0ATTACHED(?;, z) A 
-.OSUPPORTS(y, x) A -■OSUPPORTS(y, z)A 
-■OSUPPORTS(a;, z) A -■OSUPPORTS(2;, x) 
[Attached (a;, y) v Attached (y, z)] ; 

ATTACHED(x, y) A SUPPORTS(x, y)A 

-■OSuPPORTS(2;, y)A 

-■OSUPPORTED(a:) A -■OAttaCHED(?/, A 
-.OSUPPORTS(|/, x) A -.OSUPPORTS(y, z)A 
\ [ L -.OSUPPORTS(x,2;) A -.OSUPPORTS(2;,x) 

/ -lOa; = y A ^Oz = x A ^Oz = yA 

SUPPORTED(y) A -■OATTACHED(a:;, z) A 
ATTACHED(rE, y) A SUPPORTS(a;, y)A 
-.OSUPPORTS(2;,y)A 

-.OSUPPORTED(a:;) A -.OATTACHED(y, ^) A 
-.OSuPPORTS(y, x) A -■OSuPPORTS(y, z)A 
-■OSUPPORTS(a;, z) A -■OSUPPORTS(2;, x) 
[Attached (a;, y) v Attached (y, z)] ; 

-. O Attached (a:;,y) A -.OSUPPORTS(a:;,y)A 
SUPPORTS(;2, y)A 

-■OSUPPORTED(a;) A -■OATTACHED(y, A 
-■OSUPPORTS(y, x) A -■OSUPPORTS(y, z)A 
\ [ L -.OSUPPORTS(2;,2;) A -.OSUPPORTS(2;,x) 

-1O2 = w A -1O2 = X A -1O2; = yA 
PutDOWN(i[;, X, y) A SUPPORTS (z, y)A 
-■Attached (^,y) 

-^Oz = w A -^Oz = X A -^Oz = yA 
PlCKUP(it;, X, y) A SUPPORTS (2;, y) A -.ATTACHED (2;, y) 

-■Oy = z A [PiCKUp(w, X, y); PutDOWN(w, x, z)] 
PutD0WN(w, y, z) A{<} Stack(w, x, y, z) 
UNSTACK(w, x, y, z) A{<} PlCKUP(a:;, y, z) 



Figure 10: The lexicon of compound event types used by Leonard. 
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Frame 11 Frame 12 




Frame 26 



Figure 1 1 : An image sequence depicting a stack event. 




1 Frame 10 Frame 1 1 




Frame 29 



Figure 12: An image sequence depicting an unstuck event 



48 



Grounding the Lexical Semantics of Verbs 




Figure 13: An image sequence depicting a move event. 



4. An Efficient Representation and Inference Procedure for Event Logic 

One might try to implement event logic using only closed intervals of the form [q,r], where q < r. 
Such a closed interval would represent the set {p\q < p < r} of real numbers. With such closed 
intervals, one would define Allen's relations as follows: 
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Frame 57 



Frame 67 



Frame 68 



Frame 80 



Figure 14: An image sequence depicting an assemble event. 
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Frame 18 




Frame 49 





Frame 62 



Frame 63 



Frame 87 



Frame 



Figure 15: An image sequence depicting a disassemble event. 



[gi,ri] di [52,^2] = {qi < 92) A (ri > 1-2) 



One difficulty with doing so is that it would be possible for more than one Allen relation to hold 
between two intervals when one or both of them are instantaneous intervals, such as [g, q\. Both m 
and s would hold between [q,q\ and [q,r], both mi and si would hold between [q^r] and [q^q\, 
both m and fi would hold between [gr, r] and [r, r], both mi and f would hold between [r, r] and [g, r], 
and =, m, and mi would all hold between [q, q] and itself. To create a domain where exactly one 
Allen relation holds between any pair of intervals, let us consider both open and closed intervals. 
Closed intervals contain their endpoints while open intervals do not. The intervals (g, r], [q, r), and 
(g, r), where q < r, represent the sets {p\q < p <r}, {p\q <p<r}, and {p\q < p < r} of real 
numbers respectively. The various kinds of open and closed intervals can be unified into a single 
representation ^[g, r]^, where a and (3 are true or false to indicate the interval being closed or open 
on the left or right respectively.^ More specifically, t t denotes [g, r], f t denotes {q,r], 
t[(/, rjp denotes [q,r), and f[q, '"If denotes {q,r). To do this, let us use q <a r to mean q < r 
when a is true and q < r when a is false. Similarly, let us use g r to mean q > r when a 



A 



(g < r) V [a A {q 



and 



is true and q > r when a is false. More precisely, q <a r 

g >Q, r = (g > r) V [a A (g = r)]. With these, Q,[g, r]^ represents the set {p\q <a p <p r} of real 
numbers. 



5. Throughout this paper, I use lowercase Greek letters, such as a, j3, 7, 5, e, and C„ to denote Boolean values, lowercase 
Latin letters, such as p, q, r, i, j, k, and I, to denote real numbers, lowercase bold Latin letters, such as i, j, and k, to 
denote intervals or spanning intervals, and uppercase Latin letters, such as /, J, and K, to denote sets of intervals or 
sets of spanning intervals. 
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One can extend the definition of Allen's relations to both open and closed intervals as follows. 
The relation ii = 12 holds if the corresponding endpoints of ii and 12 are equal and have the same 
openness. The relation ii < 12 holds if the right endpoint of ii precedes the left endpoint of 12 or 
if they are equal and both open. For example, [1,3] < [4, 5] and [1,3) < (3, 5], but [1,3] ^ (3, 5], 
[1, 3) it [3, 5], and [1, 3] it [3, 5]. The relation ii m 12 holds if the right endpoint of ii equals 
the left endpoint of 12 and one of those endpoints is open while the other is closed. For example, 
[1, 3] m (3, 5] and [1, 3) m [3, 5] but [1, 3] 96 [3, 5] and [1, 3) 7^ (3, 5]. The relation ii 12 holds if 

• either the left endpoint of ii precedes the left endpoint of 12 or they are equal while the former 
is closed and the latter is open, 

• either the left endpoint of 12 precedes the right endpoint of ii or they are equal while both 
endpoints are closed, and 

• either the right endpoint of ii precedes the right endpoint of 12 or they are equal while the 
former is open and the latter is closed. 

For example, [1,3] [2,4], [1,3] (1,4], [1,2] [2,4], and [1,4) [2,4], but [1,3] ^ [1,4], 
[1, 2) [2, 4], and [1,4] [2, 4]. The relation ii s 12 holds if 

• the left endpoints of ii and 12 are equal and have the same openness and 

• either the right endpoint of ii precedes the right endpoint of 12 or they are equal while the 
former is open and the latter is closed. 

For example, [1,3] s [1,4], (1,3] s (1,4], and [1,3) s [1,3], but [1,3] ^ (1,4], [1,3] ^ [1,3], 

[1. 3) i [1, 3), and [1, 3] ^ [1, 3). The relation ii f 12 holds if 

• the right endpoints of ii and 12 are equal and have the same openness and 

• either the left endpoint of ii follows the left endpoint of 12 or they are equal while the former 
is open and the latter is closed. 

For example, [2,4] f [1,4], [2,4) f [1,4), and (2,4] f [2,4], but [2,4) / [1,4], (2,4] / (2,4], 
[2, 4] ; [2, 4], and [2, 4] / (2, 4]. The relation ii d 12 holds if 

• either the left endpoint of ii follows the left endpoint of 12 or they are equal while the former 
is open and the latter is closed and 

• either the right endpoint of ii precedes the right endpoint of 12 or they are equal while the 
former is open and the latter is closed. 

For example, [2,3] d [1, 4] and (1, 4) d [1,4], but [1,4) <^ [1,4], (1,4] [1,4], (1,4) ^ (1,4], and 

(1.4) ^ [1, 4). The inverse Allen relations >, mi, oi, si, fi, and di are defined analogously to the <, 
m, 0, s, f, and d relations respectively with the arguments reversed. 

The above definitions can be stated more precisely as follows: 

ai[qi,ri]i3i =a2 [g2,r2]/32 = (^1 = ^2) A (oi = «2) A (n = r2) A (/3i = /32) (1) 
«i[gi,n]/3i <a2 fe,?^2]/32 = n <(^/3iA-a2) 92 (2) 
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With the above definitions, exactly one Allen relation holds between any pair of intervals. 

I refer to the set of real numbers represented by an interval as its extension. Given the above 
definition of interval, any interval, such as [5, 4], (5, 4], [5, 4), or (5, 4), where the upper endpoint 
is less than the lower endpoint represents the empty set. Furthermore, any open interval, such 
as [5,5), (5,5], or (5,5), where the upper endpoint equals the lower endpoint also represents the 
empty set. To create a situation where the extension of each interval has a unique representation, 
let us represent all such empty sets of real numbers as {}. Thus whenever we represent an interval 

r]^ explicitly, it will have a nonempty extension and will satisfy the following normalization 
criterion: q <(aA/3) r. 

4.1 Spanning Intervals 

When using event logic, we wish to compute and represent the set / of all intervals over which some 
event-logic expression $ holds. Many event types, including all of the primitive event types used in 
Leonard, are liquid (Shoham, 1987) in the sense that if some event holds of an interval then that 
event holds of every subinterval of that interval. With real-valued interval endpoints, this creates the 
need to compute and represent an infinite set of intervals for a liquid event. Even Umiting ourselves 
to integer-valued interval endpoints, a liquid event will require the computation and representation 
of quadratically many intervals. 

To address this problem, let us introduce the notion of spanning interval. A spanning interval 
represents the set of all subintervals of [«, j], in other words {[q, r]\{i < q < j) A {i < r < j)}. 
Similarly [i-j), and represent {(g,r]|(« < q < j) A {i < r < j)}, 
{[q, r)\{i < q < j) A {i < r < j)}, and {{q, r)|('i < q < j) A {i < r < j)} respectively. We wish 
to use spanning intervals to represent the set of all intervals over which the primitive event types 
hold and to compute and represent the set of all intervals over which compound event types hold via 
structural induction over the compound event-logic expressions. A problem arises however. Given 
two liquid event types ^ and ^, the compound event type is not liquid. If $ holds over 
and \1' holds over [j -.k), then ^ might not hold over every subinterval of [«, A;). It holds over only 
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those subintervals that include j. For example, if $ holds over [1 : 10) and * holds over [8 : 20) 
then ^ holds for every interval that starts between 1 and 10 and ends between 8 and 20. But 
it doesn't hold for every subinterval of [1, 20). For example, it doesn't hold of [12, 20). I refer to 
such event types as semi liquid. Since spanning intervals are not sufficient to efficiently represent 
semi-liquid events, let us extend the notion of spanning interval. A spanning interval [k, I]] 

represents the set of intervals {[(/, r]|(« < q < j) A {k < r < I)}. Similarly the spanning intervals 
[k,l]], [[i,j], [k,l]), and [k,l]) represent the sets {(g,r]|('i < q < j) A (k < r < I)}, 

{[Q:1^)\{'>' < q < j) a {k < r < I)}, and {{q,r)\{i <q<j)A{k<r< I)} respectively. This ex- 
tended notion of spanning interval subsumes the original notion. The spanning intervals [i : j], 
{i : j], [i : j), and (i : j) can be represented as the spanning intervals {[hj]-, [hj]], 

j]; j])' ^nd ([«, j], respectively. For reasons that will become apparent in Section 4.4, 
it is necessary to also allow for spanning intervals where the ranges of endpoint values are open. 
In other words, we will need to consider spanning intervals like [{i, j], [k,l]] to represent sets like 
{[q,r]\{i < q < j) A {i < r < j)}. All told, there are six endpoints that can independently be either 
open or closed, namely q, r, i, j, k, and yielding sixty four kinds of spanning intervals. These can 
all be unified into a single representation ^[^[i, j]5,e [k, /]^]/3, where a, /3, 7, S, e, and ( are true or 
false if the endpoints q, r, i, j, k, and / are closed or open respectively. More precisely, the spanning 
interval a[7[«, j]5,e [k, represents the set 

{a[q,r]p\ii <7 Q <s j) A{k<,r <^ I)} (14) 

of intervals. 1 refer to the set of intervals represented by a spanning interval as its extension. More- 
over, a set of spanning intervals will represent the union of the extensions of its members. Addi- 
tionally, the empty set of spanning intervals will represent the empty set of intervals. I further refer 
to the set of intervals represented by a set of spanning intervals as its extension. A key result of 
this paper is that if the set of all intervals over which some set of primitive event types hold can 
be represented as finite sets of spanning intervals then the set of all intervals over which all event 
types that are expressible as compound event-logic expressions over those primitives hold can also 
be represented as finite sets of spanning intervals. 

While we require that all intervals have finite endpoints, for reasons that will also become appar- 
ent in Section 4.4, it is necessary to allow spanning intervals to have infinite endpoints, for example 
[[— cx),j], [k,l]]. Such spanning intervals with infinite endpoints represent sets of intervals with 
finite endpoints but where the range of possible endpoints is unconstrained from above or below. 

4.2 Normalizing Spanning Intervals 

Just as we desire that the extension of every interval have a unique representation, we also desire 
that the extension of every spanning interval have a unique representation. There are a number of 
situations where two different spanning intervals will have the same extension. First, all spanning 
intervals a[7[«, jja,^ [k, l]^]/) where i = oo,j = —00, A; = 00, or / = —00 represent the empty set of 
intervals, because there are no intervals with an endpoint that is less than or equal to minus infinity 
or greater than or equal to infinity. Second, if i = — 00, j = 00, k = —00, or I = 00, the value 
of 7, 5, e, or ( does not affect the denotation respectively, because there are no intervals with infinite 
endpoints. Third, if j > I, j can be decreased as far as I without changing the denotation, because 
all intervals where the upper endpoint is less than the lower endpoint equivalently denote the empty 
interval. Similarly, if k < i, k can be increased as far as i without changing the denotation. Fourth, 
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all spanning intervals where i > j or k > I represent the empty set of intervals, because the range 
of possible endpoints would be empty. Fifth, all spanning intervals where i = j and either 7 or 5 is 
false (indicating an open range for the lower endpoint) represent the empty set of intervals, because 
the range of possible endpoints would be empty. Similarly, all spanning intervals where k = I and 
either e or ( is false (indicating an open range for the upper endpoint) also represent the empty set 
of intervals. Sixth, all spanning intervals where i = I and either a or /3 is false (indicating an open 
interval) also represent the empty set of intervals, because the endpoints of an open interval must be 
different. Seventh, if j = / and ( is false, the value of S does not affect the denotation, because if 
j = I and ( is false, the upper endpoint must be less than / and the lower endpoint must be less than 
or equal to j which equals /, so the lower endpoint must be less than j. Similarly, if A; = i and 7 is 
false, the value of e does not affect the denotation. Eighth, if j = / and either a or /3 is false, the 
value of S does not affect the denotation, because the lower endpoint of an open interval must be 
less than its upper endpoint. Similarly, if k = i and either a or ^ is false, the value of e does not 
affect the denotation. 



To create a situation where the extension of every spanning interval has a unique representation, 
let us represent all empty sets of intervals as {}. When the values of i, j, k, I, a, /3, 7, S, e, or ( 
can be changed without changing the denotation, we will select the tightest such values. In other 
words, false values for the Boolean parameters, maximal values for the lower bounds, and minimal 
values for the upper bounds. Thus whenever we represent a spanning interval a[7[i,i]5,e [^:%]/9 
explicitly, it will have a nonempty extension and will satisfy the following normalization criterion: 



(1) (2 / cxd) A {j / -00) A (A; / cxd) A (/ / -cxd)A 

(2) [{i = -00) ^7] A [{j = oc) ^ -^6] A [{k = -00) ^e] A [(/ = 00) ^- ^C]A 

(3) {j<l)A{k>i)A 

(4) {i < j) A{k< l)A 

(5) [(i/i) V(7A5)] Ap/O V(eAC)]A 

(6) V(«A/3)]A 

(7) {[{j =l)A -C] ^ -5} A m =t)A -7] ^ -e}A 

(8) {[{j = l)A i^a V ^P)] ^ ^S} A {[{k = i)A V ^/3)] ^ -e} 



Criteria (1) through (8) correspond to points one through eight above. 



A spanning interval a[j[i, j]d ,f [k, is normalized if i, j, k, I, a, j3, 7, 5, e, and C cannot be 
changed without changing its denotation. Given a (potentially non-normalized) spanning interval i, 
its normalization (i) is the smallest set of normalized spanning intervals that represents the extension 
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of i. One can compute (i) as follows: 



A 



where j' = min(j, /) 
k' = max(A;, i) 
7' = 7 A (i / —00) 

5' = 5^ [min(j, /) / ^] A [{j < /) V (( A a A 
e' = e A [max(A;, i) / -00] A [{k > i) V (7 A ^ A a)] 
C = CA(//cx,) 
when {i < j') A {k' <l)A 

[{t=j')^{iA5')]A[{k' = l)^{e'AC')]A 
[{i = 1)^ (a A/3)]A 

I'i ^ 00) A if / -00) A {k' / cxd) A (/ / -00) 
{} otherwise 



An important property of spanning intervals is that for any spanning interval i, (i) contains at most 
one normaUzed spanning interval.^ 

4.3 Computing the Intersection of Two Normalized Spanning Intervals 

Given two normalized spanning intervals ii and 12, their intersection ii n i2 is a set of normal- 
ized spanning intervals whose extension is the intersection of the extensions of ii and 12- One can 
compute ii n 12 as follows: 



«i L71 



{ai [7[max(ii, 22), min(ji, 
' 71 



A 



where 



7 



]s,e [max(A;i, A;2), min(li, l2)]c]/3i) 
h > «2 



C = 

when (q!i = 
{} otherwise 



71 A 72 

72 

^1 

^1 A 62 
S2 

ei 

ei A 62 

62 

Ci 

Ci AC2 

C2 

02) A 



«1 = «2 

h < 12 

ji < h 

jl = J2 
Jl > J2 

ki > A,'2 
ki = k2 
ki < k2 
h < h 

h=l2 

h > h 



6. The reason that (i) contains at most one normalized spanning interval and not exactly one normalized spanning inter- 
val is that i may denote the empty set of intervals. For example, normalizing the (non-normalized) spanning interval 
[[10, 10], [1, 1]] yields the empty set. Many of the definitions in the coming sections compute sets of normalized 
spanning intervals as unions of one or more applications of the normalization operator (■). Each such application 
might yield either the empty set or a set containing a single normalized spanning interval. This leads to upper, but 
not lower, bounds on the size of the computed unions. 
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An important property of normalized spanning intervals is that for any two normalized spanning 
intervals ii and 12, ii fl 12 contains at most one normalized spanning interval. 

The intuition behind the above definition is as follows. All of the intervals in the extension of 
a spanning interval are of the same type, namely [g, r], (g, r], [g, r), or (g, r). The intersection of 
two spanning intervals has a nonempty extension only if the two spanning intervals contain the same 
type of intervals in their extension. If they do, and the sets contain intervals whose lower endpoint is 
bound from below by ii and 12 respectively, then the intersection will contain intervals whose lower 
endpoint is bound from below by both ii and 12- The resulting bound is open or closed depending 
on which of the input bounds is tighter. Similarly for the upper bound on the lower endpoint and 
the lower and upper bounds on the upper endpoint. 



,T,,T [-00, Oo]t]/3)U 
|T,T [-oo,oo]t]/3)U 



4.4 Computing the Complement of a Normalized Spanning Interval 

Given a normalized spanning intervals i, its complement -li is a set of normalized spanning intervals 
whose extension is the complement of the extension of i. One can compute -li as follows: 

( {a[T[-oo,oo]T,T [-00, A;]^e]/3)U \ 

{a[T[-00, 00]t,^^ [I, 00]t]/3)U 
{a[T[-00,i' 
(aU[i, 00] 

(^a[T[-00, Oo]t,t [-00, 00]t]/3)U 
(a[T[-00, 00]t,t [-00, 00]t]^/3)U 
V (^a[T[-CO,00]T,T[-00,00]T]^/3) ) 

An important property of normaUzed spanning intervals is that for any normalized spanning inter- 
val i, -li contains at most seven normalized spanning intervals. 

The intuition behind the above definition is as follows. First note that the negation of g r 
is q >-.« T. Next note that the extension of i contains intervals whose endpoints g and r satisfy 
(g >-y i) A (g <s j) A {r >^ k) A (r <^ I). Thus the extension of -li contains intervals whose 
endpoints satisfy the negation of this, namely (g i) V (g j) V (r <-,e k) V (r I). Such 
a disjunction requires four spanning intervals, the first four in the above definition. Additionally, if 
the extension of i contains intervals of the form [q,r], the extension of -li will contain all intervals 
not of the form [g, r], namely (g, r], [g, r), and (g, r). Similarly for the cases where the extension 
of i contains intervals of the form (g, r], [g, r), or (g, r). This accounts for the last three spanning 
intervals in the above definition. 

We now see why it is necessary to allow spanning intervals to have open ranges of endpoint 
values as well as infinite endpoints. The complement of a spanning interval, such as [k, I]], 
with closed endpoint ranges and finite endpoints includes spanning intervals, such as 
[[— cx), «), [—00, 00]], with open endpoint ranges and infinite endpoints. 



4.5 Computing the Span of two Normalized Spanning Intervals 

The span of two intervals ii and i2, denoted SPAN(ii,i2), is the smallest interval whose extension 
contains the extensions of both ii and i2. For example, the span of (1, 4) and [2, 6] is (1, 6]. Simi- 
larly, the span of [3, 7) and (3, 7] is [3, 7]. More generally, the lower endpoint of SPAN(ii,i2) is the 
minimum of the lower endpoints of ii and 12- The lower endpoint of SPAN(ii, i2) is open or closed 
depending on whether the smaller of the lower endpoints of ii and i2 is open or closed. Analogously, 
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the upper endpoint of SPAN(ii,i2) is the maximum of the upper endpoints of ii and 12. The upper 
endpoint of SPAN(ii, is open or closed depending on whether the larger of the upper endpoints 
of ii and 12 is open or closed. More precisely, Span (ii, can be computed as follows: 

SPAN(ai[gi,ri]/3i,a2 k2,r2]/32) = 

{[aiA(gi<g2)]V[a2A(gi>g2)]}[™i'^(5l'52),max(ri,r2)]{[;3jA(ri>r2)]V[/32A(ri<r2)]} 

The notion of span will be used in Section 4.7. 

Let us extend the notion of span to two sets of intervals by the following definition: 

SPAN(/i,/2) = U U SPAN(ii,i2) 

iie/i heh 

We will want to compute the span of two sets of intervals Ji and I2, when both Ii and I2 are 
represented as spanning intervals. Additionally, we will want the resulting span to be represented 
as a small set of spanning intervals. 

Given two normalized spanning intervals ii and 12, their span SPAN(ii, i2) is a set of normalized 
spanning intervals whose extension is the span of the extensions of ii and 12. One can compute 
S PAN (i 1,12) as follows: 

SPAN(aJT,Jn,ji]5i,ei [A;i,/l]^J/3i,a2 [72 [«2, i2]52 :e2 [^2, ^2]c2]/32) = 

iaihi [k,j]6,( [^■,^2]C2],/32)U 
(«2[72 [«2,j]5,e [^,^l]Ci]/3i)U 

where j = min(ii,j2) 

k = max(A;i,A;2) 

S = [Si A (ji < J2)] V [^2 A (ji > J2)] 

e = [ei A (A;i > k2)] V [e2 A {ki < k2)] 

An important property of normalized spanning intervals is that for any two normaUzed spanning 
intervals ii and 12, SPAN(ii,i2) contains at most four normalized spanning intervals. In practice, 

however, fewer normalized spanning intervals are needed, often only one. 

The intuition behind the above definition is as follows. Consider, first, the lower endpoint. 
Suppose that the lower endpoints qi and 52 ofii andi2 are in jij^^ and -y2 [22, J2]52 respectively. 
That means that ii qi <s^ ji and «2 <72 92 <52 h- The lower endpoint of SPAN(ii,i2) 
will be qi, when qi < q2, and q2, when qi > q2. Thus it will be qi, for all ii qi <§ 
min(jii, ji2)» and will be q2, for all ?2 <72 92 <5 min(ji, j2)» where S = 5i, when ji < j2, 
and S = 62, when ji > j2- Thus there will be two potential ranges for the lower endpoint of 
SPAN(ii,i2): -yj«i, min(ji, j2)]5 and [22, min(ji, j2)]5- When the lower endpoint of SPAN (ii, 12) 
is taken from the former, it will be open or closed depending on whether the lower endpoint of ii 
is open or closed. When it is taken from the later, it will be open or closed depending on whether 
the lower endpoint of 12 is open or closed. Thus the lower endpoint of SPAN(ii,i2) can be either 
aj-yjii, min(ji, j2)]5 or Q,J^2[?2,min(ji, j2)]5. Analogous reasoning can be applied to the upper 
endpoints. If the upper endpoints of ii and 12 are [A^^i, /i]<;Jsi and [^2, ^2]c2]/32 respectively, then 
there are two possibilities for the upper endpoint of SPAN(ii, {2), namely ^ [max(A;i, A;2), /ij^J/^i and 
(:[max(/si, A;2), ^2]c2]/32' where e = ei, when ki > k2, and e = 62, when ki < k2. 
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4.6 Computing the Vofa Normalized Spanning Interval 

Given an Allen relation r and a set / of intervals, let V{r, I) denote the set J of all intervals j such 
that irj for some i G 7. Given an Allen relation r and a normalized spanning interval i, let 'D(r, i) 
denote a set of normalized spanning intervals whose extension is V{r, I), where I is the extension 
of i. One can compute T>{r, i) as follows: 







V{= 


=^) 


A 


{i} 








A 


U {«2[(-/3iA-a2Aei)[^l, Oo]t,t ["OO, Oo]j]^^) 
a2,/32e{T,r} 




['yi[h,jl]5i,ei [kl 




p.) 


A 


U (aa [t[-00, Cxd]t,t [-00, a-/32A5i)]/32) 
a2,/32e{T,r} 










A 


U (-/3iU[^1'^i]Ci't[-00,00]t]/32) 

/32e{T,F} 




[ji[h,jl]5i,ei [kl 






A 


a2e{T,F} 




['yi[h,jl]5i,ei [kl 


^i]Ci 




A 






U (a2 [(cBi A-.a2A7i 
a2,/32e{T,F} 


)[*l,^l](/3iAa2ACi)5(-/3iA/32Aei) [^1, Oojxl/Jj) 


^(0i,ai 


[ll[h,jl]d,,ei [h 




A 






U U[t 

a2,/32e{T,r} 


—oo, 


ii]( 


-.aiAa2A5i):(aiA/32A7i) [*1 : ^l] A-./32 ACi)]/32 ) 




['yi[h,jl]6i,ei [kl 




A 


U ("i[7i[*l'il]5i'(-'/3iA/32A£i) [^1:Oo]t]/32) 




[ji[h,jl]5i,ei [kl 






A 


U ("i[7i[*l'il]5i'T [-CO,^l](/3iA-/32ACi)]/32) 

/32e{T,F} 




['yi[h,jl]5i,ei [kl 






A 


IJ (a2[T[-C«, jl](^Q,iAa2A5i)5ei [^1 1 ^l]Ci ) 
a2e{T,F} 




[ji[h,jl]5i,ei [kl 






A 


U (a2[(aiA-.a2A7i)[«l5 Co]t,£i [^1 , ^i]Ci ) 
a2e{T,F} 




[^i[h,ji]si„, [kl 






A 





U {ajT[-00,Jl 
a2,/32e{T,F} 



(iJ„Si 



(-.aiAa2A5i):(-./3iA/32Aei) [^1 : l^lTl/^a ) 
A 



U (a2[(aiA-a2A7i)[«l:00]T,T ["OO, /l] (^3^ A-/32 ACi )]/32 ) 



An important property of normalized spanning intervals is that for any normalized spanning inter- 
val i, I'(r, i) contains at most 1, 4, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, or 4 normalized spanning intervals 
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whenris=, <, >, m, mi, o, oi, s, si, f, fi, d, or di respectively. In practice, however, fewer normalized 
spanning intervals are needed, often only one. 

The intuition behind the above definition is as follows. Let us handle each of the cases sepa- 
rately. 

r =< For any intervals i'^ and 12 in the extensions of ii and 12 respectively we want i'^ < 12. From (2) 
we get ri <(_,^j;\_,„2) 92- Furthermore, from (14) we get ki ri. Combining these we 
get ki <(_,^j/\_,Q,2Aei) <12- In this case, both 02 and ^2 are free indicating that either endpoint 
of i'2 can be open or closed. 

r => For any intervals i'^ and i'2 in the extensions of ii and 12 respectively we want i'^ > i'2 . From (3) 

we get qi >(^ajA^/32) '"2- Furthermore, from (14) we get qi ji. Combining these we get 
r2 <(-.aiA-./32A5i) Ji- In this case, both a2 and ^2 are free indicating that either endpoint of i'2 
can be open or closed. 

r = m For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ m i'2. 
From (4) we get ri = q2 and (3i / 02. Furthermore, from (14) we get ki ri li. 
Combining these we get ki <ej q2 h and /3i / 02- In this case, only ^2 is free indicating 
that the upper endpoint of i'2 can be open or closed. 

r = mi For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ mi i'2. 

From (5) we get qi = r2 and ai / ^2- Furthermore, from (14) we get ii qi ji. 
Combining these we get ii r2 <5i ji and cui ^ ^2- In this case, only a2 is free indicating 
that the lower endpoint of i'2 can be open or closed. 

r = For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ i'2. From (6) 
we get qi <(aiA^a2) I2 <{PiAa2) n <(-/3iA/32) ^2- Furthermore, from (14) we get h qi 
and ki n h. Combining tiiese we get h <{aiA^a2Aji) Q2 <(/3iAa2ACi) ^1 and 
ki <(-,/3iA/32Aei) f2- In this case, both a2 and ^2 are free indicating that either endpoint of i'2 
can be open or closed. 

r = oi For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ oi i'2. 
From (7) we get q2 <(^«ia«2) <1i <(aiA/32) ^2 <(p,A-.02) ^i- Furthermore, from (14) we get 
ri h and h qi <s, ji. Combining these we get h <(aiA/32A7i) '"2 <_(/3iA^/32A<i) ^1 
and q2 <(^aiAa2A6i) ji- In this case, both a2 and ^2 are free indicating that either endpoint 
of i'2 can be open or closed. 

r = s For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ s i'2. From (8) 

we get qi = q2, ai = 012, and n <(^/3ia/32) ''2- Furthermore, from (14) we get ii 
Qi <5i Ji and ki ri. Combining these we get ai = 02, ii 92 <Si Ji» and 
^1 <(-,/3iA/32Aei) f2- In this case, only (32 is free indicating that the upper endpoint of i'2 can 
be open or closed. 

r = si For any intervals i'^ and i'2 in the extensions of ii and i2 respectively we want i'^ si i'2. 
From (9) we get qi = q2, ai = a2, and ri >(/3ja^/32) ^2- Furthermore, from (14) we get 
h <7i Qi <5i ji and ri li. Combining these we get ai = 0.2, «i <7i 92 <di ii» and 
'"2 <(5ia^S2A<i) ^1- In this case, only ^2 is free indicating that the upper endpoint of '^2 can 
be open or closed. 
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r = f For any intervals i'^ and in the extensions of ii and '\2 respectively we want i'^ f i^. From (10) 

we get qi >(^-^aiha-2) Q2, n = r2, and /3i = /32. Furthermore, from (14) we get ki n 
h and qi <s^ ji. Combining these we get j3i = ^2, h r2 <Ci ^i, and q2 <(^aiAa2ASi) 
ji . In this case, only 02 is free indicating that the lower endpoint of i'2 can be open or closed. 

r = fi For any intervals i'^ and 12 in the extensions of ii and 12 respectively we want i'^ fi 12. 
From (11) we get qi >(^aiA^a2) Q2, n = r2, and Pi = ^2- Furthermore, from (14) we 
get ki <ej ri li and ii qi. Combining these we get /3i = 132, ki r2 ^i, and 
ii <(QiA^a2A7i) 12- In this case, only a2 is free indicating that the lower endpoint of i'2 can 
be open or closed. 

r = d For any intervals i'^ and 12 in the extensions of ii and 12 respectively we want i'^ d i2. 
From (12) we get qi >{^aiAa2) Q2 and n <(-,;3ia/32) ^2- Furthermore, from (14) we get 
qi <5i ji and ki n. Combining these we get q2 <(^aiAa2A5i) Ji and ki <(^/3ia/32A£i) 
r2- In this case, both 02 and ^2 are free indicating that either endpoint of 12 can be open or 
closed. 

r = di For any intervals i'^ and 12 in the extensions of ii and 12 respectively we want i'^ di 12- 
From (13) we get qi <(aiA^a2) Q2 and n >(/3ja^/32) ^2- Furthermore, from (14) we get 
h <7i qi andn h. Combining these we get ii <(«ia-«2A7i) 92 and r2 <(/3ia-/32ACi) ^i- 
In this case, both a2 and ^2 are free indicating that either endpoint of i'2 can be open or closed. 

4.7 Computing the I of two Normalized Spanning Intervals 

Given an Allen relation r and two sets / and J of intervals, let X(I, r, J) denote the set K of all 
intervals k such that k = SPAN(i, j) for some i G / and j G J, where irj. Given an Allen relation r 
and two normahzed spanning intervals i and j, let X(i, r, j) denote a set of normaUzed spanning 
intervals whose extension is r, J), where / and J are the extensions of i and j respectively. 
One can compute X(i, r, j) as follows: 

Iii,r,j)^ U U U U SPAN(i",j") 

i'e©(r-ij)i"ei'nij'eD(r,i)j"ej'nj 

Here, r^^ denotes the inverse relation corresponding to r, i.e. the same relation as r but with the 
arguments reversed. It is easy to see that |X(-,r, •)! < 4\V{r, •)p. Thus an important property of 
normalized spanning intervals is that for any two normalized spanning intervals i and j, I(i, r,j) 
contains at most 4, 64, 64, 16, 16, 64, 64, 16, 16, 16, 16, 64, or 64 normaUzed spanning intervals, 
when r is =, <, >, m, mi, 0, oi, s, si, f, fi, d, or di respectively. While simple combinatorial 
enumeration yields the above weak bounds on the number of normalized spanning intervals needed 
to represent I(i, r, j), in practice, far fewer normalized spanning intervals are needed, in most cases 
only one. 

The intuition behind the above definition is as follows. Let / and J be the extensions of i and j 
respectively. The extension of the set of all i' is the set of all intervals i such that irj for some j in J. 
Furthermore, the extension of the set of all i" is the set of all intervals i in / such that irj for some j 
in J. Similarly, the extension of the set of all j' is the set of all intervals j such that irj for some i 
in /. Analogously, the extension of the set of all j" is the set of all intervals j in J such that irj for 
some i in I. Thus the extension of the set of all SPAN(i", j") is the set of all intervals k such that 
k = SPAN(i, j) where i is in I, j is in J, and irj. 
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4.8 An Efficient Inference Procedure for Event Logic 

Given the above procedures for computing (i), ii n 12, -^i, SPAN(ii, i2), ^(r, i), and X(i, r, j), one 
can now define a procedure for computing £{M, $). This procedure takes a model M along with 
an event-logic expression $ and computes a set of normahzed spanning intervals that represents the 
set / of intervals i for which $@i is true. The model M is a set of atomic event-occurrence formulae 
of the formp(ci, . . . , c„)@i, where p{ci, . . . , c„) is a ground primitive event-logic expression and i 
is a normahzed spanning interval. A model entry p(ci, . . . , c„)@i indicates that the primitive event 
p{ci, . . . ,Cn) occurred during all intervals in the extension of i. 

£{M,p{ci,...,Cn)) 

f(M,0^#) 

The procedure performs sttuctural induction on It computes a set of normahzed spanning in- 
tervals to the represent the occurrence of each atomic event-logic expression in $ and recursively 
combines the sets so computed for each child subexpression to yield the sets for each parent subex- 
pression. An important property of this inference procedure is that for any finite model M, £{M, $), 
the set / of intervals i for which $@i is true, can be represented by a finite set of normahzed spanning 
intervals. Nominally, the number of normalized spanning intervals in £{M, can be exponential 
in the subexpression depth of $ because each step in the structural induction can introduce a con- 
stant factor growth in the size of the set. However, in practice, such exponential growth does not 
occur. Computing £{M,^) for all of the event types given in Figure 10 for all of the movies that 
have been tried so far have yielded sets of fewer than a dozen normahzed spanning intervals. 

5. Experimental Results 

The techniques described in this paper have been implemented as a system called Leonard and 
tested on a number of video sequences.^ Leonard successfully recognizes the events pick up, put 
down, stack, unstuck, move, assemble, and disassemble using the definitions given in Figure 10. 
Figures 1 and 1 1 through 15 show the key frames from movies that depict these seven event types. 
These movies were filmed using a Canon VC-C3 camera and a Mattox Meteor frame grabber at 
320 x 240 resolution at 30fps. Figures 4 and 16 through 20 show the results of segmentation, track- 
ing, and model reconstruction for those key frames superimposed on the original images. Figures 5 
and 21 through 25 show the results of event classification for these movies. These figures show 
Leonard correctly recognizing the intended event classes for each movie. 

7. The code for Leonard, the video input sequences discussed in this paper, and the full frame-by-frame output of 

Leonard on those sequences is available as Online Appendix 1, as well as from 
ftp://ftp.nj.nec. com/ pub/qobi/leonard . tar . Z. 



= {i|p(ci,...,c„)@iGM} 
= f(M,$) Uf(M,*) 

= U ••• U i'in...ni; 

llG-'ll l„G^"n 

where ^(M,<5) = {ii,...,i„} 

= U U U^(i^^J) 

ie£{M,'i>)iee{M,^)r&R 

= U 
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Frame 3 Frame 11 Frame 12 Frame 23 




Frame 24 Frame 26 



Figure 16: The output of the segmentation-and-tracking and model-reconstruction components ap- 
plied to the image sequence from Figure 11, an image sequence that depicts a stack 
event. 




Frame 1 Frame 10 Frame 1 1 Frame 24 




Frame 25 Frame 29 



Figure 17: The output of the segmentation-and-tracking and model-reconstruction components ap- 
plied to the image sequence from Figure 12, an image sequence that depicts an unstack 
event. 
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Frame Frame 8 Frame 9 Frame 16 




Frame 17 Frame 33 Frame 34 Frame 45 




Frame 46 Frame 47 



Figure 18: The output of the segmentation- and-tracking and model-reconstruction components ap- 
plied to the image sequence from Figure 13, an image sequence that depicts a move 
event. 



64 



Grounding the Lexical Semantics of Verbs 




Figure 19: The output of the segmentation-and-tracking and model-reconstruction components ap- 
plied to the image sequence from Figure 14, an image sequence that depicts an assemble 
event. 
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Figure 20: The output of the segmentation-and-tracking and model-reconstruction components ap- 
plied to the image sequence from Figure 15, an image sequence that depicts a disassem- 
ble event. 
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(PUT-DOWN MOVING RED BLUE) @{ [ [0, 12] , [24,30] ) } 
(STACK MOVING RED BLUE GREEN) (§ { [ [ , 12 ] , [2 4 , 30 ] ) } 

(SUPPORTED? MOVING) @ {[ [13 : 24] ) } 

(SUPPORTED? RED) (a [ [ [0 : 30] ) } 

(SUPPORTED? BLUE) @ { [ [0 : 30] ) } 

(SUPPORTS? MOVING RED) @{ [ [0:12] ) } 

(SUPPORTS? RED MOVING) @ { [ [ 13 : 2 4 ] ) } 

(SUPPORTS? RED BLUE) (§ { [ [ 1 9 : 2 ] ) , [[21:22])} 

(SUPPORTS? GREEN MOVING) @{ [ [19:20] ) , [ [21:22] ) } 

(SUPPORTS? GREEN RED ) @ { [ [ 1 9 : 2 ] ) , [[21:22])} 

(SUPPORTS? GREEN BLUE) @ { [ [ : 30 ] ) } 

(SUPPORTS? BLUE MOVING) @ { [ [ 1 3 : 24 ] ) } 

(SUPPORTS? BLUE RED) @ { [ [ 12 : 30 ] ) } 

(CONTACTS? RED BLUE) @{ [ [12:19] ) , [[20:21]), [[22:30])} 
(CONTACTS? GREEN BLUE) @ { [ [ : 30 ] ) } 
(ATTACHED? MOVING RED) (§ { [ [ : 24 ] ) } 
(ATTACHED? RED BLUE) @ { [ [ 1 9 : 2 ] ) , [[21:22])} 



Figure 21: The output of the event-classification component appUed to the model sequence from 
Figure 16. Note that the stack event is correctly recognized, as well as the constituent 
put down event. 
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(PICK-UP MOVING RED BLUE) @{ [[0,11], [25, 33]) } 
(UNSTACK MOVING RED BLUE GREEN) @ { [ [ 0, 11 ] , [25, 33 ] ) } 

(SUPPORTED? MOVING) @{ [ [11 :23] ) } 
(SUPPORTED? RED) @ { [ [0 : 36] ) } 
(SUPPORTED? BLUE) (a { [ [0 : 36] ) } 
(SUPPORTS? MOVING RED ) @ { [ [ 2 3 : 3 6 ] ) } 
(SUPPORTS? RED MOVING) @ {[[ 11 : 23] ) } 
(SUPPORTS? RED BLUE ) @ { [ [ 13 : 1 4 ] ) } 
(SUPPORTS? GREEN MOVING) (§ { [ [ 1 3 : 1 4 ] ) } 
(SUPPORTS? GREEN RED) @{ [ [13:14] ) } 
(SUPPORTS? GREEN BLUE ) @ { [ [ : 3 6 ] ) } 
(SUPPORTS? BLUE MOVING) @ {[[ 11 : 23 ]) } 
(SUPPORTS? BLUE RED ) (§ { [ [ : 2 5 ] ) } 
(CONTACTS? MOVING RED ) @ { [ [ 3 4 : 3 6 ] ) } 
(CONTACTS? RED BLUE ) @ { [ [ : 1 3 ] ) , [[14:24])} 
(CONTACTS? GREEN BLUE ) @ { [ [ : 1 3 ] ) , [[14:36])} 
(ATTACHED? MOVING RED) @ { [ [11 : 33] ) } 
(ATTACHED? RED BLUE ) @ { [ [ 13 : 1 4 ] ) } 
(ATTACHED? GREEN BLUE ) @ { [ [ 1 3 : 14 ] ) } 



Figure 22: The output of the event-classification component applied to the model sequence from 
Figure 17. Note that the unstuck event is correctly recognized, as well as the constituent 
pick up event. 
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(PICK-UP MOVING RED GREEN) @ { [ [ , 9 ] , [ 1 7 , 4 6 ] ) } 

(PUT-DOWN MOVING RED BLUE) @{ [ [17, 35] , [46, 52] ) } 
(MOVE MOVING RED GREEN BLUE ) (3 { [ [ , 9 ] , [ 4 6 , 52 ] ) } 



(SUPPORTED? MOVING) (a { [ [9 : 15] ) } 

(SUPPORTED? RED) @{ [ [0:52] ) } 
(SUPPORTED? BLUE) @ { [ [35 : 46] ) } 
(SUPPORTS? MOVING RED ) @ { [ [ 1 7 : 4 6 ] ) } 
(SUPPORTS? MOVING BLUE) (§ { [ [35 : 4 6] ) } 
(SUPPORTS? RED MOVING) @ { [ [ 9 : 15 ] ) } 
(SUPPORTS? RED BLUE ) @ { [ [ 35 : 4 6 ] ) } 
(SUPPORTS? GREEN MOVING) @ { [ [ 9 : 15 ] ) } 
(SUPPORTS? GREEN RED ) @ { [ [ : 1 7 ] ) } 
(SUPPORTS? BLUE RED ) @ { [ [ 4 6 : 52 ] ) } 
(CONTACTS? RED GREEN) @ {[[ : 17 ]) } 
(CONTACTS? RED BLUE ) @ { [ [ 4 6 : 52 ] ) } 
(ATTACHED? MOVING RED) (§ { [ [ 9 : 4 6] ) } 
(ATTACHED? RED BLUE) (§ { [ [35 : 46] ) } 



Figure 23: The output of the event-classification component appUed to the model sequence from 
Figure 18. Note that the move event is correctly recognized, as well as the constituent 
pick up and put down subevents. 
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(PUT-DOWN MOVING RED GREEN) @ { [ [57, 68 ] , [ 6 8 , 8 7 ] ) } 
(PUT-DOWN MOVING GREEN BLUE) @{ [ [18, 35] , [41, 47] ) } 
(STACK MOVING RED GREEN BLUE) @ { [ [57, 68] , [68, 87] ) } 
(ASSEMBLE MOVING RED GREEN BLUE) (§ { [ [ 1 8 , 35 ] , [ 68 , 87 ] ) } 



(SUPPORTED? MOVING) @ {[ [10 : 18] ) , [[47:57])} 
(SUPPORTED? RED) (a { [ [57 : 87] ) } 
(SUPPORTED? GREEN) @{ [ [11:87] ) } 
(SUPPORTED? BLUE) @{ [ [35:41] ) } 
(SUPPORTS? MOVING RED) @ { [ [57 : 68] ) } 
(SUPPORTS? MOVING GREEN) (? { [ [ 1 1 : 4 1 ] ) } 
(SUPPORTS? MOVING BLUE ) @ { [ [ 35 : 4 1 ] ) } 
(SUPPORTS? RED MOVING ) @ {[[ 1 : 1 8 ]) , [[47:57])} 
(SUPPORTS? RED GREEN) @ { [ [ 1 1 : 1 6 ] ) } 
(SUPPORTS? GREEN RED) (§ { [ [ 68 : 87 ] ) } 
(SUPPORTS? GREEN BLUE) @{ [ [35:41] ) } 
(SUPPORTS? BLUE GREEN) @ { [ [ 4 1 : 8 7 ] ) } 
(CONTACTS? RED GREEN) @ { [ [ 68 : 8 7 ] ) } 
(CONTACTS? GREEN BLUE) @ { [ [41 : 87 ] ) } 
(ATTACHED? MOVING RED ) @ { [ [ 1 1 : 1 6 ] ) , [[49:68])} 
(ATTACHED? MOVING GREEN ) @ { [ [ 1 1 : 4 1 ] ) } 
(ATTACHED? GREEN BLUE) @ { [ [35 : 41 ] ) } 



Figure 24: The output of the event-classification component applied to the model sequence from 
Figure 19. Note that the assemble event is correctly recognized, as well as the constituent 
put down and stack subevents. 
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(PICK-UP MOVING RED GREEN) @{ [ [0, 19] , [23, 50] ) } 
(PICK-UP MOVING GREEN BLUE) @{ [ [22, 58] , [ 62, 87] ) } 
(UNSTACK MOVING RED GREEN BLUE ) (§ { [ [ , 1 9 ] , [ 23 , 50 ] ) } 
(DISASSEMBLE MOVING RED GREEN BLUE ) @ { [ [ , 1 9 ] , [ 62 , 87 ] ) } 

(SUPPORTED? MOVING) @ {[ [19 : 22] ) } 
(SUPPORTED? RED) (a { [ [0 : 50] ) } 
(SUPPORTED? GREEN) @{ [ [0:87] ) } 
(SUPPORTED? BLUE) @ { [ [58 : 62] ) } 
(SUPPORTS? MOVING RED) @ { [ [ 23 : 50 ] ) } 
(SUPPORTS? MOVING GREEN) (§ { [ [ 58 : 87 ] ) } 
(SUPPORTS? MOVING BLUE ) @ { [ [ 5 8 : 62 ] ) } 
(SUPPORTS? RED MOVING) @ { [ [ 1 9 : 2 2 ] ) } 
(SUPPORTS? GREEN MOVING) @ { [ [ 1 9 : 22 ] ) } 
(SUPPORTS? GREEN RED ) (§ { [ [ : 2 3 ] ) } 
(SUPPORTS? GREEN BLUE) @ { [ [ 58 : 62 ] ) } 
(SUPPORTS? BLUE GREEN) @ {[[ : 58 ]) } 
(CONTACTS? RED GREEN) @ { [ [ : 2 3 ] ) } 
(CONTACTS? GREEN BLUE) @ { [ [ : 58 ] ) } 
(ATTACHED? MOVING RED ) @ { [ [ 1 9 : 5 ] ) } 
(ATTACHED? MOVING GREEN) @ { [ [ 58 : 87 ] ) } 
(ATTACHED? GREEN BLUE) @ { [ [ 58 : 62 ] ) } 



Figure 25: The output of the event-classification component applied to the model sequence from 
Figure 20. Note that the disassemble event is correctly recognized, as well as the con- 
stituent pick up and unstuck subevents. 
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In Figure 4(a), Frames through 1 correspond to the first subevent of a pick up event, Frames 2 
tlirough 13 correspond to the second subevent, and Frames 14 through 22 correspond to the third 
subevent. In Figure 4(b), Frames through 13 correspond to the first subevent of a put down event. 
Frames 14 through 22 correspond to the second subevent, and Frames 23 through 32 correspond 
to the third subevent. Leonard correctly recognizes these as instances of pick up and put down 
respectively. In Figure 16, Frames through 11, 12 through 23, and 24 through 30 correspond to 
the three subevents of a put down event. LEONARD correctly recognizes this as a put down event 
and also as a stack event. In Figure 17, Frames through 10, 11 through 24, and 25 through 33 
correspond to the three subevents of a pick up event. Leonard correctly recognizes this as a pick 
up event and also as an unstack event. In Figure 18, Frames through 8, 9 through 16, and 17 
through 45 correspond to the three subevents of a pick up event and Frames 17 through 33, 34 
through 45, and 46 through 52 correspond to the three subevents of a put down event. Leonard 
correctly recognizes the combination of these two events as a move event. In Figure 19, Frames 18 
through 32, 33 through 40, and 41 through 46 correspond to the three subevents of a put down event 
and Frames 57 through 67 and 68 through 87 correspond to the first and third subevents of a second 
put down event, with the second subevent being empty. The second put down event is also correctly 
recognized as a stack event and the combination of these two events is correctly recognized as an 
assemble event. In Figure 20, Frames through 18, 19 through 22, and 23 through 50 correspond to 
the three subevents of a pick up event and Frames 23 through 56, 57 through 62, and 63 through 87 
correspond to the three subevents of a second pick up event. The first pick up event is also correctly 
recognized as an unstack event and the combination of these two events is correctly recognized as 
a disassemble event. These examples show that LEONARD correctly recognizes each of the seven 
event types with no false positives. 

As discussed in the introduction, using force dynamics and event logic to recognize events offers 
several advantages over the prior approach of using motion profile and hidden Markov models. 

• robustness against variance in motion profile 

• robustness against presence of extraneous objects in the field of view 

• ability to perform temporal and spatial segmentation of events 

• ability to detect non-occurrence of events 

Figures 26 through 35 illustrate these advantages. Figure 26 shows a pick up event from the left 
in contrast to Figure 4(a) which is from the right. Even though these have different motion pro- 
files. Figure 31 shows that Leonard correctly recognizes that these exhibit the same sequence 
of changes in force-dynamic relations and constitute the same event type, namely pick up. Fig- 
ure 27 shows a pick up event with two extraneous blocks in the field of view. Figure 32 shows that 
Leonard correctly recognizes that these extraneous blocks do not participate in any events and, 
despite their presence, the truth conditions for a pick up event still hold between the other objects. 
Figure 28 shows a pick up event, followed by a put down event, followed by another pick up event, 
followed by another put down event. Figure 33 shows that Leonard correctly recognizes this se- 
quence of four event occurrences. Figure 29 shows two simultaneous pick up events. Figure 34 
shows that Leonard correctly recognizes these two simultaneous event occurrences. Finally, Fig- 
ure 30 shows two non-events. Figure 35 shows that Leonard is not fooled into thinking that these 
constitute pick up or put down events, even though portions of these events have similar motion 
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Frame 5 Frame 10 Frame 1 1 Frame 17 




Frame 1 8 Frame 22 



Figure 26: The output of the segmentation-and-tracking and model-reconstruction components on 
an image sequence depicting a pick up event from the left instead of from the right. 




Frame 6 Frame 7 Frame 8 Frame 18 




Frame 19 Frame 24 



Figure 27: The output of the segmentation-and-tracking and model-reconstruction components on 
an image sequence depicting a pick up event with extraneous objects in the field of view. 



profile to pick up and put down events. LEONARD correctly recognizes that these movies do not 
match any known event types. 

An approach to even classification is valid and useful only if it is robust. A preliminary evalua- 
tion of the robustness of LEONARD was conducted. Thirty five movies were filmed, five instances 
of each of the seven event types pick up, put down, stack, unstuck, move, assemble, and disassemble. 
These movies resemble those in Figures 1 and 1 1 through 15. The same subject performed all thirty 
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Frame 4 Frame 8 Frame 9 Frame 18 




Frame 19 Frame 44 Frame 45 Frame 51 




Frame 52 Frame 69 Frame 70 Frame 77 




Frame 78 Frame 102 Frame 103 Frame 109 




Frame 110 Frame 111 



Figure 28: The output of the segmentation-and-tracking and model -reconstruction components on 
an image sequence depicting a sequence of a pick up event, followed by a put down 
event, followed by another pick up event, followed by another put down event. 
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Figure 29: The output of the segmentation-and-tracking and model-reconstruction components on 
an image sequence depicting two simultaneous pick up events. 



five events. These movies were processed by LEONARD. The results of this preliminary evaluation 
are summarized in Table 1. A more extensive evaluation of LEONARD will be conducted in the 
future. 



6. Discussion 

This paper presents a new approach to event recognition that differs from the prior approach in 
two ways. First, it uses force dynamics instead of motion profile as the feature set to differentiate 
between event types. Second, it uses event logic instead of hidden Markov models as the compu- 
tational framework for classifying time-series data containing these features. Nominally, these two 
differences are independent. One can imagine using hidden Markov models to classify time series 
of force-dynamic features or using event logic to classify time series of motion-profile features. 
While such combinations are feasible in principle, they are unwieldy in practice. 

Consider using event logic to classify time series of motion-profile features. Motion-profile 
features, such as position, velocity, and acceleration, are typically continuous. A given event usu- 
ally corresponds to a vague range of possible feature values. This vagueness is well modeled by 
continuous-output hidden Markov models. Event logic, which is discrete in nature, requires quan- 
tizing precise feature- value ranges. Such quantization can lead to a high misclassification rate. 
Furthermore, continuous distributions allow partitioning a multidimensional feature space into dif- 
ferent classes where the boundaries between classes are more complex than lines along the feature 
axes. Emulating this in event logic would require complex disjunctive expressions. 

Similarly, consider using hidden Markov models to classify time series of force-dynamic fea- 
tures. Suppose that a feature vector contains n features. Since both force-dynamic and motion- 
profile features typically relate pairs of objects, n is often quadratic in the number of event par- 

8. The movies, as well as the results produced by LEONARD when processing the movies, are available as Online 
Appendix 1, as well as from ftp://ftp.nj.nec. com/pub /qobi /I eon ard . tar . Z. 
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Frame 4 Frame 11 Frame 12 




Frame 18 Frame 19 Frame 21 




Frame Frame 6 Frame 7 




Frame 12 Frame 13 Frame 18 



Figure 30: The output of the segmentation-and-tracking and model-reconstruction components ap- 
plied to the image sequences from Figure 2, image sequences that depict non-events. 
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(PICK-UP MOVING RED GREEN) @{ [ [0, 11] , [18, 30] ) } 



(SUPPORTED? RED) (a { [ [0 : 30] ) } 

(SUPPORTED? GREEN) @ { [ [11 : 18] ) } 
(SUPPORTS? MOVING RED) @{ [ [11:30] ) } 
(SUPPORTS? MOVING GREEN) @ { [ [ 1 1 : 1 8 ] ) } 
(SUPPORTS? RED GREEN) (§ { [ [ 1 1 : 1 8 ] ) } 
( SUPPORTS ? GREEN RED )@{[[0:11])} 
(CONTACTS? RED GREEN ) @ {[[ : 1 1 ]) } 
(ATTACHED? MOVING RED) @ { [ [ 11 : 30 ] ) } 
(ATTACHED? RED GREEN) (§ { [ [ 1 1 : 1 8 ] ) } 



Figure 31: The output of the event-classification component apphed to the model sequence from 
Figure 26. Note that the pick up event is correctly recognized despite the fact that it was 
performed from the left instead of from the right. 



(PICK-UP MOVING RED GREEN) @{[[0,8],[19,30])} 

( SUPPORTED ? MOVING) (§ { [[8:19]) } 

(SUPPORTED? RED) @ { [ [0 : 30] ) } 
(SUPPORTED? BLUE) @{ [ [0:30] ) } 
(SUPPORTS? MOVING RED) @{[[19:30])} 
(SUPPORTS? RED MOVING) (§ { [ [ 8 : 1 9 ] ) } 
(SUPPORTS? GREEN MOVING) @ { [ [8 : 19] ) } 
(SUPPORTS? GREEN RED) @{ [ [0:19] ) } 
(SUPPORTS? YELLOW BLUE) @{ [ [0 :30] ) } 
(CONTACTS? RED GREEN) (§{[[ : 1 ]) , [[16:19])} 
(CONTACTS? BLUE YELLOW) @ {[[ : 30 ]) } 
(ATTACHED? MOVING RED ) @ { [ [ 8 : 30 ] ) } 
(ATTACHED? RED GREEN) @ { [ [ 1 : 1 6 ] ) } 



Figure 32: The output of the event-classification component applied to the model sequence from 
Figure 27. Note that the pick up event is correctly recognized despite the presence of 
extraneous objects in the field of view. 



77 



SiSKIND 



(PICK-UP MOVING RED GREEN) @ { [ [ 52 , 7 ] , [ 7 8 , 1 02 ] ) , 

[ [0, 9] , [19,44] ) } 
(PUT-DOWN MOVING RED GREEN) @{ [ [19, 44] , [52, 70] ) , 

[ [78, 102] , [110, 117] ) } 

(SUPPORTED? MOVING) @{[[9:18]), [[44:52]), 

[[70:77]), [ [102 : 110] ) } 
(SUPPORTED? RED) @ { [ [0 : 117] ) } 

(SUPPORTS? MOVING RED) (§ { [ [ 18 : 44 ] ) , [[78:102])} 
(SUPPORTS? RED MOVING) @ { [ [ 9 : 1 8 ] ) , [[44:52]), 

[[70:77]), [[102:110])} 
(SUPPORTS? GREEN MOVING) @ { [ [ 9 : 1 8 ] ) , [[44:52]), 

[[70:77]), [[102:110])} 
(SUPPORTS? GREEN RED) @ {[ [0 : 19] ) , [[44:78]), [[102:117])} 
(CONTACTS? RED GREEN) (3 { [ [ : 9 ] ) , [[13:18]), [[46:70]), 

[ [106:117] ) } 
(ATTACHED? MOVING RED) (§ { [ [ 9 : 52 ] ) , [[70:110])} 
(ATTACHED? RED GREEN) @{ [ [9:13] ) , [[70:76]), [[104:106])} 



Figure 33: The output of the event-classification component appUed to the model sequence from 
Figure 28. Note that Leonard correctly recognizes a pick up event, followed by a put 
down event, followed by another pick up event, followed by another put down event. 
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(PICK-UP MOVING RED GREEN) @ { [ [ , 6 ] , [ 1 6, 22 ] ) } 
(PICK-UP MOVING YELLOW BLUE ) @ { [ [ , 12 ] , [ 17 , 22 ] ) } 

(SUPPORTED? MOVING) @ { [ [ 6 : 16] ) } 
(SUPPORTED? MOVING) (a { [ [12 : 15] ) } 
(SUPPORTED? RED) @ { [ [0 : 22] ) } 
(SUPPORTED? YELLOW) @{ [ [0:22] ) } 
(SUPPORTS? MOVING RED ) @ { [ [ 1 6 : 2 2 ] ) } 
(SUPPORTS? MOVING YELLOW) (§ { [ [ 1 7 : 22 ] ) } 
(SUPPORTS? RED MOVING ) @ { [ [ 6 : 1 6 ] ) } 
(SUPPORTS? GREEN MOVING ) @ { [ [ 6 : 1 6 ] ) } 
(SUPPORTS? GREEN RED) @ { [ [ : 1 6 ] ) } 
(SUPPORTS? BLUE MOVING) @ {[[ 12 : 15 ]) } 
(SUPPORTS? BLUE YELLOW) @ {[[ : 1 7 ]) } 
(SUPPORTS? YELLOW MOVING) @ { [ [12 : 15] ) } 
(CONTACTS? RED GREEN) @ {[[ : 15 ]) } 
(CONTACTS? BLUE YELLOW) (§{[[ : 17 ]) } 
(ATTACHED? MOVING RED ) @ { [ [ 6 : 22 ] ) } 
(ATTACHED? MOVING YELLOW) (3 { [ [ 12 : 22 ] ) } 



Figure 34: The output of the event-classification component applied to the model sequence from 
Figure 29. Note that the two simultaneous pick up events are correctly recognized. 
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(SUPPORTED 


? RED) @ { [ [0 : 19] 


) } 




(SUPPORTED 


? MOVING) @ { [ [13 


:31] ) } 




(SUPPORTS? 


RED MOVING) @ { [ 


[13: 31 


] ) } 


(SUPPORTS? 


MOVING RED) @ { [ 


[0 : 13] 


) } 


(SUPPORTS? 


GREEN RED) @ { [ [ 


12 : 19] 


) } 


(SUPPORTS? 


GREEN MOVING) (§ 


{ [ [13: 


19] ) } 


(ATTACHED? 


RED MOVING) @ { [ 


[0:31] 


) } 


(ATTACHED? 


RED GREEN) @ { [ [ 


13 : 19] 


) } 



(a) 



(SUPPORTED? RED) @ { [ [0 : 25] ) } 
(SUPPORTED? GREEN) @ { [ [7 : 13] ) } 
(SUPPORTS? MOVING RED) @{ [ [7:13] ) } 
(SUPPORTS? MOVING GREEN) (§ { [ [ 7 : 13 ] ) } 
(SUPPORTS? RED GREEN) (? { [ [ 7 : 1 3 ] ) } 
( SUPPORTS ? GREEN RED )@{[[0:7]), [[13:25])} 
(CONTACTS? RED GREEN) @ { [ [ : 7 ] ) , [[13:25])} 
(ATTACHED? MOVING RED) @ { [ [ 7 : 13 ] ) } 
(ATTACHED? RED GREEN) @ { [ [ 7 : 1 3 ] ) } 

(b) 

Figure 35: The output of the event-classification component applied to the model sequences from 
Figure 30. Note that LEONARD correctly recognizes that no events occurred in these 
sequences. 
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pick up put down stack unstack move assemble disassemble 



pick up 

put down 

stack 

unstack 



5/5 



5/5 



move 



5/5 5/5 
5/5 4/5 
4/5 5/5 4/5 



assemble 
disassemble 



5/10 1/5 1/5 
10/10 5/5 5/5 



Table 1: An evaluation of the robustness of Leonard on a test set of five movies of each of seven 
event types. The rows represent movies of the indicated event types. The coluimis repre- 
sent classifications of the indicated event type. The entries x/y indicate x, the number of 
times that a movie of the indicated event type was classified as the indicated event type, 
and y, the number of times that the movie should have been classified as the indicated event 
type. Note that stack entails put down, unstack entails pick up, move entails both a pick up 
and a put down, assemble entails both a put down and a separate stack, and disassemble 
entails both a pick up and a separate unstack. Thus off-diagonal entries are expected in 
these cases. There were six false negatives and no false positives. Four of the false neg- 
atives were for the event type assemble. In three of those cases, Leonard successfully 
recognized the constituent put down subevent but failed to recognize the constituent stack 
subevent as well as the associated put down subevent. In one case, LEONARD failed to 
recognize both the constituent put down and stack subevents along with the associated put 
down constituent of the stack subevent. One of the false negatives was for the event type 
move. In this case, LEONARD successfully recognized the constituent put down subevent 
but failed to recognize the constituent pick up subevent. The remaining false negative was 
for the event type unstack. In this case, LEONARD successfully recognized the constituent 
pick up subevent but failed to recognize the aggregate unstack event. 
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ticipants. Let us contrast the number of parameters needed to represent these features in both the 
motion-profile and force-dynamic approaches. Since, as discussed above, motion-profile features 
are typically continuous, hidden Markov models with continuous outputs can be used in the motion- 
profile approach. When the features are independent, such a model requires 0{n) parameters per 
state to specify the output distributions. Even if one uses, say, a multivariate Gaussian to model 
dependent features, this requires only O(n^) parameters per state to specify the output distributions 
in the motion-profile approach. However, force-dynamic features are Boolean. This requires us- 
ing discrete-output hidden Markov models. Such models output a stream of symbols, not feature 
vectors. Constructing an appropriate alphabet of output symbols requires considering all possible 
subsets of features. This requires 0(2") parameters per state to specify the output distributions in 
the force-dynamic approach. Thus continuous-output hidden Markov models appear to be better 
suited to an approach that uses motion-profile features while event logic appears to be better suited 
to an approach that uses force-dynamic features. 

Humans use language for three fundamental purposes: we describe what we see, we ask others 
to perform actions, and we engage in conversation. The first two require grounding language in 
perception and action. Only the third involves disembodied use of language. Almost all research 
in computational linguistics has focused on such disembodied language use. Data-base query pro- 
cessing, information extraction and retrieval, and spoken-language dialog all use language solely to 
manipulate internal representations. In contrast, the work described in this paper grounds language 
in perception of the external world. It describes an implemented system, called Leonard, that uses 
language to describe events observed in short image sequences. 

Why is perceptual grounding of language important and relevant to computational linguistics? 
Current approaches to lexical semantics suffer from the 'bold-face syndrome.' All too often, the 
meanings of words, like throw, are taken to be uninterpreted symbols, like throw, or expressions 
over uninterpreted symbols, Uke cause to go (Leech, 1969; Miller, 1972; Schank, 1973; Jackendoff, 
1983, 1990; Pinker, 1989). Since the interpretation of such symbols is left to informal intuition, the 
correctness of any meaning representation constructed from such uninterpreted symbols cannot be 
verified. In other words, how is one to know whether cause to go is the correct meaning of throw? 
Perceptual grounding offers a way to verify semantic representations. Having an implemented sys- 
tem use a collection of semantic representations to generate appropriate descriptions of observations 
gives evidence that those semantic representations are correct. This paper takes a small step in this 
direction. In contrast to prior work, which presents informal semantic representations whose in- 
terpretation is left to intuition, it presents perceptually-grounded semantic representations. While 
the system described in this paper addresses only perceptual grounding of language, the long-term 
goal of this research is to provide a unified semantic representation that is sufficiently powerful to 
support all three forms of language use: perception, action, and conversation. 

Different parts of speech in language typically describe different aspects of visual percepts. 
Nouns typically describe objects. Verbs typically describe events. Adjectives typically describe 
properties. Prepositions typically describe spatial and temporal relations. Grounding language in 
visual perception will require construction of semantic representations for all of these different parts 
of speech. It is likely that different parts of speech will require different machinery to represent their 
lexical semantics. In other words, whatever the ultimate representation of apple and chair are, they 
are Ukely to be based on very different principles than the ultimate representation of pick up and put 
down. These, in turn, are likely to be further different from those needed to represent in, on, red, and 
big. Indeed, machine vision research, at least that aspect of machine vision research that focuses 
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on object recognition, can be viewed as an attempt to perceptually ground the lexical semantics of 
nouns. In contrast, this paper focuses solely on verbs. Accordingly, it develops machinery that is 
very different from what is typically used in the machine- vision community, machinery that is more 
reminiscent of that which is used in the knowledge-representation community. On the other hand, 
unlike typical knowledge-representation work, it grounds that machinery in image processing. 

When one proposes a representation, such as cause to go, as the meaning of a word, such 
as throw, one must specify three things to effectively specify the meaning of that word. First, 
one must specify the lexical semantics of the individual primitives, how one determines the truth 
conditions of items like cause and to go. Second, one must specify the compositional semantics of 
the representation, how one combines the truth conditions of primitives like cause and to go to get 
the aggregate truth conditions of compound expressions Uke cause to go. Third, one must specify a 
lexical entry, a map from a word, like throw, to a compound expression, Uke cause to go. All three 
are necessary in order to precisely specify the word meaning. 

Prior work in lexical semantics, such as the work of Leech (1969), Miller (1972), Schank (1973), 
Jackendoff (1983, 1990), and Pinker (1989), is deficient in this regard. It specifies the third com- 
ponent without the first two. In other words, it formulates lexical entries in terms of compound 
expressions like cause to go, without specifying the meanings of the primitives, like cause and to 
go, and without specifying how these meanings are combined to form the aggregate meaning of the 
compound expression. This paper attempts to address that deficiency by specifying all three com- 
ponents. First, the lexical semantics of the event-logic primitives is precisely specified in Figure 9. 
Second, the compositional semantics of event logic is precisely specified in Section 3. Third, lexi- 
cal entries for several verbs are precisely specified in Figure 10. These three components together 
formally specify the meanings of those verbs with a level of precision that is absent in prior work. 

While these lexical entries are precise, there is no claim that they are accurate. Lexical entries 
are precise when their meaning is reduced to an impartial mechanical procedure. Lexical entries 
are accurate when they properly reflect the truth conditions for the words that they define. Even 
ignoring homonymy and metaphor, words such as move and assemble clearly have meanings that 
are much more complex than what is, and even can be, represented with the machinery presented 
in this paper But that holds true of prior work as well. The lexical entries given in, for example. 
Leech (1969), Miller (1972), Schank (1973), Jackendoff (1983, 1990), and Pinker (1989) also do 
not accurately reflect the truth conditions for the words that they define. The purpose of this paper is 
not to improve the accuracy of definitions. In fact, the definitions given in prior work might be more 
accurate, in some ways, than those given here. Rather, its purpose is to improve the precision of 
definitions. The definitions given in prior work are imprecise and that imprecision makes assessing 
their accuracy a subjective process: do humans think an informally specified representation matches 
their intuition. In contrast, precision allows objective assessment of accuracy: does the output of a 
mechanical procedure appUed to sample event occurrences match human judgments of which words 
characterize those occurrences. 

Precision is the key methodological advance of this work. Precise specification of the meaning 
of lexical semantic representations, by way of perceptual grounding, makes accuracy assessment 
possible by way of experimental evaluation. Taking this first step of advancing precision and per- 
ceptual grounding will hopefully allow us to take future steps towards improving accuracy. 
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7. Related Work 

Most prior work uses motion profile, some combination of relative-and absolute linear-and-angular 
positions, velocities, and accelerations, as the features that drive event classification. That work fol- 
lows the tradition of linguists and cognitive scientists, such as Leech (1969), Miller (1972), Schank 
(1973), Jackendoff (1983, 1990), and Pinker (1989), that represent the lexical semantics of verbs 
via the causal, aspectual, and directional qualities of motion. Some linguists and cognitive scien- 
tists, such as Herskovits (1986) and Jackendoff and Landau (1991), have argued that force-dynamic 
relations (Talmy, 1988), such as support, contact, and attachment, are crucial for representing the 
lexical semantics of spatial prepositions. For example, in some situations, part of what it means for 
one object to be on another object is for the former to be in contact with, and supported by, the latter. 
In other situations, something can be on something else by way of attachment, as in the knob on the 
door. Siskind (1992) has argued that change in the state of force-dynamic relations plays a more 
central role in specifying the lexical semantics of simple spatial motion verbs than motion profile. 
The particular relative-and-absolute hnear-and-angular positions, velocities, and accelerations don't 
matter when picking something up or putting something down. What matters is a state change in 
the source of support of the patient. Similarly, what distinguishes putting something down from 
dropping it is that, in the former, the patient is always supported, while in the latter, the patient 
undergoes unsupported motion. 

The work described in this paper differs from prior work in visual-event perception in a num- 
ber of respects. Waltz and Boggess (1979), Waltz (1981), Marr and Vaina (1982), and Rubin 
and Richards (1985) describe unimplemented frameworks that are not based on force dynamics. 
Thibadeau (1986) describes a system that recognizes when an event occurs but not what event 
occurs. His system processes simulated video and is not based on force dynamics. Badler (1975), 
Adler (1977), Tsuji, Morizono, and Kuroda (1977), Okada (1979), Tsuji, Osada, and Yachida (1979, 
1980), Abe, Soga, and Tsuji (1981), Abe and Tsuji (1982), Novak and Bulko (1990), and Regier 
(1992) describe systems that process simulated video and that are not based on force dynamics. 
Borchardt (1984, 1985) presents event definitions that are based on force-dynamic relations but 
does not present techniques for recovering those relations automatically from either simulated or 
real video. Yamoto et al. (1992), Stamer (1995), Siskind and Morris (1996), Siskind (1996), Brand 
(1996, 1997a), Brand, Oliver, and Pentland (1997), and Bobick and Ivanov (1998) present systems 
that recognize event occurrences from real video using motion profile but not force dynamics. These 
systems use hidden Markov models rather than event logic as the event-classification engine. Funt 
(1980) presents a heuristic approach to stability analysis that operates on simulated video but does 
not perform model reconstruction or event classification. Brand, Birnbaum, and Cooper (1993) and 
Brand (1997b) present a heuristic approach to stability analysis that operates on real video but do not 
use stability analysis to perform model reconstruction and event classification. Blum, Griffith, and 
Neumann (1970) and Fahlman (1974) present stability-analysis algorithms that are based on linear 
programming but do not use stability analysis to perform model reconstruction or event classifica- 
tion. These stability-analysis algorithms use dynamics rather than kinematics. Siskind (1991, 1992, 
1993, 1994, 1995, 1997) presents systems that operate on simulated video and use force dynamics 
to recognize event occurrences. All of that work, except Siskind (1997), uses heuristic approaches 
to stability analysis, model reconstruction, and event classification. Siskind (1997) presents an early 
version of the stability-analysis and event-logic-based event-recognition techniques used in the cur- 
rent system. Mann et al. (1996, 1997) and Mann and Jepson (1998) present a system that does 
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model reconstruction from real video but does not use the recovered force-dynamic relations to 
perform event classification. That work uses an approach to stability analysis based on dynamics 
instead of the kinematic approach used in this paper. 

There is also a body of prior work that grounds fragments of natural-language semantics in 
physical relations between objects in graphically represented blocks worlds or for solving physics 
word problems. Examples of such work include Bobrow (1964), Winograd (1972), and Palmer 
(1990) as well the ISSAC system (Novak, 1976) and the Mecho project (Bundy, Luger, Palmer, 
& Welham, 1998; Bundy, Byrd, Luger, Melhsh, Milne, & Palmer, 1979; Luger, 1981). While 
that work does not focus on recognizing events, per se, it does relate lexical semantics to physical 
relations between represented objects. 

Leonard currently does not contain a learning component. It is given a fixed physical theory 
of the world, implicitly represented in the model-reconstruction procedure, and a fixed collection 
of event-type descriptions, expUcitly formulated as event-logic expressions. One potential area for 
future work would be to automatically learn a physical theory of the world and/or event-type descrip- 
tions. Adding a learning component could potentially produce more robust model-reconstruction 
and event-classification components than those currently constructed by hand. Techniques such as 
those presented in Martin and Geffner (2000) and Cumby and Roth (2000) might be useful for this 
task. 

8. Conclusion 

This paper has presented Leonard, a comprehensive implemented system for recovering event 
occurrences from video input. It differs from the prior approach to the same problem in two funda- 
mental ways. First, it uses state changes in the force-dynamic relations between objects, instead of 
motion profile, as the key descriptive element in defining event types. Second, it uses event logic, 
instead of hidden Markov models, to perform event classification. One key result of this paper is 
the formulation of spanning intervals, a novel efficient representation of the infinite sets of intervals 
that arise when processing hquid and semi-liquid events. A second key result of this paper is the 
formulation of an efficient procedure, based on spanning intervals, for inferring all occurrences of 
compound event types from occurrences of primitive event types. The techniques of force-dynamic 
model reconstruction, spanning intervals, and event-logic inference have been used to successfully 
recognize seven event types from real video: pick up, put down, stack, unstuck, move, assemble, 
and disassemble. Using force dynamics and event logic to perform event recognition offers four 
key advantages over the prior approach of using motion profile and hidden Markov models. First, 
it is insensitive to variance in the motion profile of an event occurrence. Second, it is insensitive to 
the presence of extraneous objects in the field of view. Third, it allows temporal segmentation of 
sequential and parallel event occurrences. Fourth, it robustly detects the non-occurrence of events 
as well as their occurrence. 

At a more fundamental level, this paper advances a novel methodology: grounding lexical- 
semantic representations in visual-event perception as a means for assessing the accuracy of such 
representations. Prior work in lexical-semantic representations has used calculi whose semantics 
were not precisely specified. Lexical entries formulated in such calcuU derived their meaning from 
intuition and thus could not be empirically tested. By providing a lexical-semantic representation 
whose semantics is precisely specified via perceptual grounding, this paper opens up the field of 
lexical semantics to empirical evaluation. The particular representations advanced in this paper 
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are clearly only approximations to the ultimate truth. This follows from the primitive state of our 
understanding of language and perception. Nonetheless, I hope that this paper offers an advance 
towards the ultimate truth, both through its novel methodology and the particular details of its 
mechanisms. 
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